All,
Are there currently any solutions in Julia to handle larger-than-memory datasets in a similar way you do in a DataFrame? The reason I'm asking is that R has the limitation that you need to fit all your data into memory. On the other hand, SAS (while being quite different) does not have this limitations. In the age of "big data" this can be quite an advantage. Of course, you can "patch" this situation, e.g. in R you can use the ff or bigmemory packages, or use SQL. But my point is that it is bolted on, and you need to spend extra mental loops switching between, say, data.frame and ff, instead of focusing on your data problem at hand. This is a clear advantage of SAS, where you don't have to do that. So I'm wondering how this is handled in Julia. Thanks, M P.S.: I do not intend to start a flame war, e.g. whether R or SAS or Julia is better. I'm just interested to find out whether such a solution exists in Julia (I haven't found any, but maybe I overlooked something). And if no such solution exists, given that Julia is still young, evolving, and malleable (in a positive sense), it might make sense to think about it. -- You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
Not currently, but it's been talked about as long as there's been DataFrames in Julia. See these issues, and references therein, for a start: Also look around in the package and issues list for DataStreams (which I believe are not currently functional) which is a related issue.
On Tue, Aug 5, 2014 at 10:42 AM, Michael Smith <[hidden email]> wrote: All, You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
In reply to this post by Michael Smith
Hi I don't know Julia, but in R you don't need to load all data into memory just like SAS you can read off disk, in R both proprietary Revolutionary Analytics R I think working with Hortonworks/Cloudera and Hadoop and Yarn (I don't know if there is a Julia package for Yarn?, I know little of Hadoop and [not really interested in Java ] and Yarn so I suggest you contact someone at Hortonworks or Revolution R) g which I saw a demonstration of in R User group here in Ottawa, Canada as well as Revolution R's other proprietary methods and bigmemory http://cran.r-project.org/web/packages/bigmemory/index.html and http://www.bigmemory.org/ can handle more data. I Here is a discussion on large size data. On Tue, Aug 5, 2014 at 10:42 AM, Michael Smith <[hidden email]> wrote: All, You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
In reply to this post by Harlan Harris
At some point, we need to create some additional data tools for working with data sets that do not fit in memory. Harlan’s list touches on a lot of the best strategies for doing that in a way that would smoothly integrate with the rest of the language.
— John
-- On Aug 5, 2014, at 7:48 AM, Harlan Harris <[hidden email]> wrote:
You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
Thanks everybody. It should come as no surprise to me that the Julia
community is already working on this. Awesome. One minor point that I have not seen discussed in the issues is a reference to the plyrmr package, which is essentially plyr/dplyr for Hadoop. https://github.com/RevolutionAnalytics/RHadoop/wiki/user%3Eplyrmr%3EHome Maybe it's possible to pillage some ideas from there. M On 08/06/2014 12:23 PM, John Myles White wrote: > At some point, we need to create some additional data tools for working > with data sets that do not fit in memory. Harlan's list touches on a lot > of the best strategies for doing that in a way that would smoothly > integrate with the rest of the language. > > -- John > > On Aug 5, 2014, at 7:48 AM, Harlan Harris <[hidden email] > <mailto:[hidden email]>> wrote: > >> Not currently, but it's been talked about as long as there's been >> DataFrames in Julia. See these issues, and references therein, for a >> start: >> >> https://github.com/JuliaStats/DataFrames.jl/issues/25 >> https://github.com/JuliaStats/DataFrames.jl/issues/26 >> >> Also look around in the package and issues list for DataStreams (which >> I believe are not currently functional) which is a related issue. >> >> >> >> On Tue, Aug 5, 2014 at 10:42 AM, Michael Smith <[hidden email] >> <mailto:[hidden email]>> wrote: >> >> All, >> >> Are there currently any solutions in Julia to handle >> larger-than-memory >> datasets in a similar way you do in a DataFrame? >> >> The reason I'm asking is that R has the limitation that you need >> to fit >> all your data into memory. On the other hand, SAS (while being quite >> different) does not have this limitations. >> >> In the age of "big data" this can be quite an advantage. >> >> Of course, you can "patch" this situation, e.g. in R you can use >> the ff >> or bigmemory packages, or use SQL. >> >> But my point is that it is bolted on, and you need to spend extra >> mental >> loops switching between, say, data.frame and ff, instead of >> focusing on >> your data problem at hand. This is a clear advantage of SAS, where you >> don't have to do that. So I'm wondering how this is handled in Julia. >> >> Thanks, >> >> M >> >> P.S.: I do not intend to start a flame war, e.g. whether R or SAS or >> Julia is better. I'm just interested to find out whether such a >> solution >> exists in Julia (I haven't found any, but maybe I overlooked >> something). >> And if no such solution exists, given that Julia is still young, >> evolving, and malleable (in a positive sense), it might make sense to >> think about it. >> >> -- >> You received this message because you are subscribed to the Google >> Groups "julia-stats" group. >> To unsubscribe from this group and stop receiving emails from it, >> send an email to [hidden email] >> <mailto:julia-stats%[hidden email]>. >> For more options, visit https://groups.google.com/d/optout. >> >> >> >> -- >> You received this message because you are subscribed to the Google >> Groups "julia-stats" group. >> To unsubscribe from this group and stop receiving emails from it, send >> an email to [hidden email] >> <mailto:[hidden email]>. >> For more options, visit https://groups.google.com/d/optout. > > -- > You received this message because you are subscribed to the Google > Groups "julia-stats" group. > To unsubscribe from this group and stop receiving emails from it, send > an email to [hidden email] > <mailto:[hidden email]>. > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
Isn't Hive already "plyr for Hadoop"?
-- John On Aug 6, 2014, at 6:29 AM, Michael Smith <[hidden email]> wrote: > Thanks everybody. It should come as no surprise to me that the Julia > community is already working on this. Awesome. > > One minor point that I have not seen discussed in the issues is a > reference to the plyrmr package, which is essentially plyr/dplyr for > Hadoop. > > https://github.com/RevolutionAnalytics/RHadoop/wiki/user%3Eplyrmr%3EHome > > Maybe it's possible to pillage some ideas from there. > > M > > > On 08/06/2014 12:23 PM, John Myles White wrote: >> At some point, we need to create some additional data tools for working >> with data sets that do not fit in memory. Harlan's list touches on a lot >> of the best strategies for doing that in a way that would smoothly >> integrate with the rest of the language. >> >> -- John >> >> On Aug 5, 2014, at 7:48 AM, Harlan Harris <[hidden email] >> <mailto:[hidden email]>> wrote: >> >>> Not currently, but it's been talked about as long as there's been >>> DataFrames in Julia. See these issues, and references therein, for a >>> start: >>> >>> https://github.com/JuliaStats/DataFrames.jl/issues/25 >>> https://github.com/JuliaStats/DataFrames.jl/issues/26 >>> >>> Also look around in the package and issues list for DataStreams (which >>> I believe are not currently functional) which is a related issue. >>> >>> >>> >>> On Tue, Aug 5, 2014 at 10:42 AM, Michael Smith <[hidden email] >>> <mailto:[hidden email]>> wrote: >>> >>> All, >>> >>> Are there currently any solutions in Julia to handle >>> larger-than-memory >>> datasets in a similar way you do in a DataFrame? >>> >>> The reason I'm asking is that R has the limitation that you need >>> to fit >>> all your data into memory. On the other hand, SAS (while being quite >>> different) does not have this limitations. >>> >>> In the age of "big data" this can be quite an advantage. >>> >>> Of course, you can "patch" this situation, e.g. in R you can use >>> the ff >>> or bigmemory packages, or use SQL. >>> >>> But my point is that it is bolted on, and you need to spend extra >>> mental >>> loops switching between, say, data.frame and ff, instead of >>> focusing on >>> your data problem at hand. This is a clear advantage of SAS, where you >>> don't have to do that. So I'm wondering how this is handled in Julia. >>> >>> Thanks, >>> >>> M >>> >>> P.S.: I do not intend to start a flame war, e.g. whether R or SAS or >>> Julia is better. I'm just interested to find out whether such a >>> solution >>> exists in Julia (I haven't found any, but maybe I overlooked >>> something). >>> And if no such solution exists, given that Julia is still young, >>> evolving, and malleable (in a positive sense), it might make sense to >>> think about it. >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "julia-stats" group. >>> To unsubscribe from this group and stop receiving emails from it, >>> send an email to [hidden email] >>> <mailto:julia-stats%[hidden email]>. >>> For more options, visit https://groups.google.com/d/optout. >>> >>> >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "julia-stats" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [hidden email] >>> <mailto:[hidden email]>. >>> For more options, visit https://groups.google.com/d/optout. >> >> -- >> You received this message because you are subscribed to the Google >> Groups "julia-stats" group. >> To unsubscribe from this group and stop receiving emails from it, send >> an email to [hidden email] >> <mailto:[hidden email]>. >> For more options, visit https://groups.google.com/d/optout. > > -- > You received this message because you are subscribed to the Google Groups "julia-stats" group. > To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
Yes, and FWIW, I use Julia and Hive all the time via ODBC.jl. I just don't do it within the context of a "dataframe".
-- On Wednesday, August 6, 2014 10:52:29 AM UTC-4, John Myles White wrote: Isn’t Hive already “plyr for Hadoop”? You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
Hive is great, but it's SQL-like, and plyrmr is R, which is more similar
to Julia, and because of this difference, plyrmr might be worth keeping in mind. Also, Antonio Piccolboni (the guy behind much of this) has done an excellent job integrating R with Hadoop (mostly MapReduce), so maybe there's something to be learned. On a different note, since Spark seems to be the new big thing, and is going to obsolete much of MapReduce (based on what I hear from people in the industry), it might be important to keep in mind that the industry is moving towards Spark (and Taz) and away from MapReduce. (This is not to say that Hadoop will be deprecated, since Spark and Taz integrate well with Hadoop; just that MapReduce, which is part of Hadoop, will mainly be used to keep old projects running, while new development will mainly be done for Spark (and Taz).) Anyway, this is leading us more to _really_ big data. In contrast, what I had in mind was something like what PyTables does for Python (i.e. sort of _intermediate_ big data, not really big data), but with better integration with DataFrame. I think HDF5 (not HDFS) has been already discussed in one of the issues, so things look fine (although I haven't seen PyTables mentioned explicitly in the github issues mentioned by Harlan, and since PyTables provides an abstraction on top of HDF5, PyTables might also be worth considering to get some ideas from). Anyway, that's my core dump, hope it helps. Cheers, M On 08/07/2014 01:00 AM, Randy Zwitch wrote: > Yes, and FWIW, I use Julia and Hive all the time via ODBC.jl. I just > don't do it within the context of a "dataframe". > > On Wednesday, August 6, 2014 10:52:29 AM UTC-4, John Myles White wrote: > > Isn’t Hive already “plyr for Hadoop”? > > — John > > On Aug 6, 2014, at 6:29 AM, Michael Smith <[hidden email] > <javascript:>> wrote: > > > Thanks everybody. It should come as no surprise to me that the Julia > > community is already working on this. Awesome. > > > > One minor point that I have not seen discussed in the issues is a > > reference to the plyrmr package, which is essentially plyr/dplyr for > > Hadoop. > > > > > https://github.com/RevolutionAnalytics/RHadoop/wiki/user%3Eplyrmr%3EHome > <https://github.com/RevolutionAnalytics/RHadoop/wiki/user%3Eplyrmr%3EHome> > > > > > Maybe it's possible to pillage some ideas from there. > > > > M > > > > > > On 08/06/2014 12:23 PM, John Myles White wrote: > >> At some point, we need to create some additional data tools for > working > >> with data sets that do not fit in memory. Harlan's list touches > on a lot > >> of the best strategies for doing that in a way that would smoothly > >> integrate with the rest of the language. > >> > >> -- John > >> > >> On Aug 5, 2014, at 7:48 AM, Harlan Harris <[hidden email] > <javascript:> > >> <mailto:[hidden email] <javascript:>>> wrote: > >> > >>> Not currently, but it's been talked about as long as there's been > >>> DataFrames in Julia. See these issues, and references therein, > for a > >>> start: > >>> > >>> https://github.com/JuliaStats/DataFrames.jl/issues/25 > <https://github.com/JuliaStats/DataFrames.jl/issues/25> > >>> https://github.com/JuliaStats/DataFrames.jl/issues/26 > <https://github.com/JuliaStats/DataFrames.jl/issues/26> > >>> > >>> Also look around in the package and issues list for DataStreams > (which > >>> I believe are not currently functional) which is a related issue. > >>> > >>> > >>> > >>> On Tue, Aug 5, 2014 at 10:42 AM, Michael Smith > <[hidden email] <javascript:> > >>> <mailto:[hidden email] <javascript:>>> wrote: > >>> > >>> All, > >>> > >>> Are there currently any solutions in Julia to handle > >>> larger-than-memory > >>> datasets in a similar way you do in a DataFrame? > >>> > >>> The reason I'm asking is that R has the limitation that you need > >>> to fit > >>> all your data into memory. On the other hand, SAS (while > being quite > >>> different) does not have this limitations. > >>> > >>> In the age of "big data" this can be quite an advantage. > >>> > >>> Of course, you can "patch" this situation, e.g. in R you can use > >>> the ff > >>> or bigmemory packages, or use SQL. > >>> > >>> But my point is that it is bolted on, and you need to spend > extra > >>> mental > >>> loops switching between, say, data.frame and ff, instead of > >>> focusing on > >>> your data problem at hand. This is a clear advantage of SAS, > where you > >>> don't have to do that. So I'm wondering how this is handled > in Julia. > >>> > >>> Thanks, > >>> > >>> M > >>> > >>> P.S.: I do not intend to start a flame war, e.g. whether R or > SAS or > >>> Julia is better. I'm just interested to find out whether such a > >>> solution > >>> exists in Julia (I haven't found any, but maybe I overlooked > >>> something). > >>> And if no such solution exists, given that Julia is still young, > >>> evolving, and malleable (in a positive sense), it might make > sense to > >>> think about it. > >>> > >>> -- > >>> You received this message because you are subscribed to the > >>> Groups "julia-stats" group. > >>> To unsubscribe from this group and stop receiving emails from > it, > >>> send an email to [hidden email] <javascript:> > >>> <mailto:julia-stats%[hidden email] > <javascript:>>. > >>> For more options, visit https://groups.google.com/d/optout > <https://groups.google.com/d/optout>. > >>> > >>> > >>> > >>> -- > >>> You received this message because you are subscribed to the Google > >>> Groups "julia-stats" group. > >>> To unsubscribe from this group and stop receiving emails from > it, send > >>> an email to [hidden email] <javascript:> > >>> <mailto:[hidden email] <javascript:>>. > >>> For more options, visit https://groups.google.com/d/optout > <https://groups.google.com/d/optout>. > >> > >> -- > >> You received this message because you are subscribed to the Google > >> Groups "julia-stats" group. > >> To unsubscribe from this group and stop receiving emails from it, > send > >> an email to [hidden email] <javascript:> > >> <mailto:[hidden email] <javascript:>>. > >> For more options, visit https://groups.google.com/d/optout > <https://groups.google.com/d/optout>. > > > > -- > > You received this message because you are subscribed to the Google > Groups "julia-stats" group. > > To unsubscribe from this group and stop receiving emails from it, > send an email to [hidden email] <javascript:>. > > For more options, visit https://groups.google.com/d/optout > <https://groups.google.com/d/optout>. > > -- > You received this message because you are subscribed to the Google > Groups "julia-stats" group. > To unsubscribe from this group and stop receiving emails from it, send > an email to [hidden email] > <mailto:[hidden email]>. > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
In reply to this post by Michael Smith
What about something like Continuum's Blaze, where there is a single table and array interface to a variety of data formats and computational engines.
-- Calculations are separated from the data, thus backend's can be swapped out and driven by common syntax. http://blaze.pydata.org/docs/latest/index.html You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
If somebody wants to build that, it would be awesome.
-- John On Aug 13, 2014, at 2:42 PM, Ariel Katz <[hidden email]> wrote:
You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
In reply to this post by Michael Smith
What about something like Continuum's Blaze, where there is a single table and array interface to a variety of data formats and computational engines.
-- Calculations are separated from the data, thus backends can be swapped out and driven by common syntax. Spark is one of the compatible formats. You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
In reply to this post by John Myles White
Considering the resources and time invested in Blaze so far, that would certainly be a significant undertaking.
-- They already provided Julia bindings for bokeh, I wonder if they would consider doing the same for Julia... On Wednesday, August 13, 2014 5:42:45 PM UTC-4, John Myles White wrote:
You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
In reply to this post by John Myles White
Considering the resources and time invested in Blaze so far, that would certainly be a significant undertaking.
-- They already provided Julia bindings for bokeh, I wonder if they would consider doing the same for Blaze... On Wednesday, August 13, 2014 5:42:45 PM UTC-4, John Myles White wrote:
You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
I don't think Continuum has anything to port with the Julia bindings for Bokeh. I believe that project's largely the work of a single volunteer: https://github.com/samuelcolvin/Bokeh.jl/graphs/contributors
-- John On Aug 13, 2014, at 2:53 PM, Ariel Katz <[hidden email]> wrote:
You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
Ah ok.
On Wednesday, August 13, 2014 5:56:36 PM UTC-4, John Myles White wrote:
You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
In reply to this post by Michael Smith
I agree, we need a platform able to work transparently with data of any size, at least bigger than memory, and ideally distributed on several computers.
-- Solutions such as Spark are not complete, they only offer the basic functionalities to build something else. We don't just need to be able to get some summaries, as we do with databases, we need to be able to do all operations with big data, operations such as multiplying to big matrixes, fitting a mixed-effect models, MCMC, etc. Bigmemory or ff allows you to do some simple things but they cannot be used by other packages like lme4. You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
In reply to this post by Ramesh Fernando
Yes, but you can only do simple things such as summaries or use functions implemented on that special packages. You can do linear regression, till now but you can't more complex things such as mixed effect regression or use stan nor any other generic bayesian package.
-- The same goes for Spark, you can only use predefined functions, very simple ones, or create your own by hand, but it's very difficult that you can program from scratch something like lme4. On Tuesday, August 5, 2014 at 5:04:38 PM UTC+2, Ramesh Fernando wrote:
You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
We're not completely there yet, but with Query.jl and
StructuredQueries.jl, combined with JuliaDB/JuliaData packages, one should be able to work on out-of-memory data sets as (or more) efficiently as e.g. SAS. The high-level API is the same whether you work on a DataFrame or on an external data base. There's also OnlineStats.jl for computing statistics without loading the full data set in memory at once. Regards Le mercredi 28 septembre 2016 à 15:48 -0700, Juan a écrit : > Yes, but you can only do simple things such as summaries or use functions implemented on that special packages. You can do linear regression, till now but you can't more complex things such as mixed effect regression or use stan nor any other generic bayesian package. > The same goes for Spark, you can only use predefined functions, very simple ones, or create your own by hand, but it's very difficult that you can program from scratch something like lme4. > > > > > Hi I don't know Julia, but in R you don't need to load all data into memory just like SAS you can read off disk, in R both proprietary Revolutionary Analytics R I think working with Hortonworks/Cloudera and Hadoop and Yarn (I don't know if there is a Julia package for Yarn?, I know little of Hadoop and [not really interested in Java ] and Yarn so I suggest you contact someone at Hortonworks or Revolution R) g which I saw a demonstration of in R User group here in Ottawa, Canada as well as Revolution R's other proprietary methods and bigmemory http://cran.r-project.org/web/packages/bigmemory/index.html and http://www.bigmemory.org/ can handle more data. I Here is a discussion on large size data. > > https://groups.google.com/forum/#!topic/julia-stats/eqYT85_vUlg > > Regards, > > Ramesh > > > > > > > > On Tue, Aug 5, 2014 at 10:42 AM, Michael Smith <[hidden email]> wrote: > > > All, > > > > > > Are there currently any solutions in Julia to handle larger-than-memory > > > datasets in a similar way you do in a DataFrame? > > > > > > The reason I'm asking is that R has the limitation that you need to fit > > > all your data into memory. On the other hand, SAS (while being quite > > > different) does not have this limitations. > > > > > > In the age of "big data" this can be quite an advantage. > > > > > > Of course, you can "patch" this situation, e.g. in R you can use the ff > > > or bigmemory packages, or use SQL. > > > > > > But my point is that it is bolted on, and you need to spend extra mental > > > loops switching between, say, data.frame and ff, instead of focusing on > > > your data problem at hand. This is a clear advantage of SAS, where you > > > don't have to do that. So I'm wondering how this is handled in Julia. > > > > > > Thanks, > > > > > > M > > > > > > P.S.: I do not intend to start a flame war, e.g. whether R or SAS or > > > Julia is better. I'm just interested to find out whether such a solution > > > exists in Julia (I haven't found any, but maybe I overlooked something). > > > And if no such solution exists, given that Julia is still young, > > > evolving, and malleable (in a positive sense), it might make sense to > > > think about it. > > > > > > -- > > > You received this message because you are subscribed to the Google Groups "julia-stats" group. > > > > > > To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. > > > > > > For more options, visit https://groups.google.com/d/optout. > > > > > > > > -- > You received this message because you are subscribed to the Google Groups "julia-stats" group. > > To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. > > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
What about using a tuple of distributed vectors/arrays as table subclass, or using dagger for an out of core lazy array.
-- Then it can be loaded into a distributed array for linear algebra. On Thursday, September 29, 2016 at 4:33:21 AM UTC-4, Milan Bouchet-Valat wrote: We're not completely there yet, but with Query.jl and You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
In reply to this post by Milan Bouchet-Valat
Yes, at least in theory it should be possible to e.g. load a very large CSV file with CSV.jl, transform it with Query.jl and then feed it into OnlineStats.jl. I think the architecture of all three packages should be such that this could work with a dataset that is larger than memory. In practice I don't think anyone has tried and I'm sure we would run into things that need fixing, but I can't think of some basic design decision in any of these packages that would prevent this kind of thing in principle.
There is a general question of the core interop type for these things. Right now things like regression packages mostly expect a DataFrame. But we could imagine a world where these packages expected a more generic type. I think right now there are a bunch of potential options out there: both DataStreams and Query define their own streaming interfaces for tabular data (in the case of Query it is just a normal julia iterator that returns NamedTuple elements). DataStreams in addition defines a column based interface that might be much faster when the dataset actually fits into memory (pure speculation on my end). I think there are also a bunch of attempts out there to define something like an abstract table structure, but I'm not sure to what extend they would enable a streaming data story. > -----Original Message----- > From: [hidden email] [mailto:[hidden email]] > On Behalf Of Milan Bouchet-Valat > Sent: Thursday, September 29, 2016 1:33 AM > To: [hidden email] > Subject: Re: [julia-stats] DataFrame and Memory Limitations > > We're not completely there yet, but with Query.jl and StructuredQueries.jl, > combined with JuliaDB/JuliaData packages, one should be able to work on > out-of-memory data sets as (or more) efficiently as e.g. SAS. The high-level > API is the same whether you work on a DataFrame or on an external data > base. > > There's also OnlineStats.jl for computing statistics without loading the full > data set in memory at once. > > > Regards > > > Le mercredi 28 septembre 2016 à 15:48 -0700, Juan a écrit : > > Yes, but you can only do simple things such as summaries or use functions > implemented on that special packages. You can do linear regression, till now > but you can't more complex things such as mixed effect regression or use > stan nor any other generic bayesian package. > > The same goes for Spark, you can only use predefined functions, very > simple ones, or create your own by hand, but it's very difficult that you can > program from scratch something like lme4. > > > > > > > Hi I don't know Julia, but in R you don't need to load all data > into memory just like SAS you can read off disk, in R both proprietary > Revolutionary Analytics R I think working with Hortonworks/Cloudera and > Hadoop and Yarn (I don't know if there is a Julia package for Yarn?, I know > little of Hadoop and [not really interested in Java ] and Yarn so I suggest you > contact someone at Hortonworks or Revolution R) g which I saw a > demonstration of in R User group here in Ottawa, Canada as well as > Revolution R's other proprietary methods and bigmemory http://cran.r- > project.org/web/packages/bigmemory/index.html > and http://www.bigmemory.org/ can handle more data. I Here is a > discussion on large size data. > > > https://groups.google.com/forum/#!topic/julia-stats/eqYT85_vUlg > > > Regards, > > > Ramesh > > > > > > > > > > > On Tue, Aug 5, 2014 at 10:42 AM, Michael Smith <[hidden email]> > wrote: > > > > All, > > > > > > > > Are there currently any solutions in Julia to handle > > > > larger-than-memory datasets in a similar way you do in a DataFrame? > > > > > > > > The reason I'm asking is that R has the limitation that you need > > > > to fit all your data into memory. On the other hand, SAS (while > > > > being quite > > > > different) does not have this limitations. > > > > > > > > In the age of "big data" this can be quite an advantage. > > > > > > > > Of course, you can "patch" this situation, e.g. in R you can use > > > > the ff or bigmemory packages, or use SQL. > > > > > > > > But my point is that it is bolted on, and you need to spend extra > > > > mental loops switching between, say, data.frame and ff, instead of > > > > focusing on your data problem at hand. This is a clear advantage > > > > of SAS, where you don't have to do that. So I'm wondering how this is > handled in Julia. > > > > > > > > Thanks, > > > > > > > > M > > > > > > > > P.S.: I do not intend to start a flame war, e.g. whether R or SAS > > > > or Julia is better. I'm just interested to find out whether such a > > > > solution exists in Julia (I haven't found any, but maybe I overlooked > something). > > > > And if no such solution exists, given that Julia is still young, > > > > evolving, and malleable (in a positive sense), it might make sense > > > > to think about it. > > > > > > > > -- > > > > You received this message because you are subscribed to the Google > Groups "julia-stats" group. > > > > > > > To unsubscribe from this group and stop receiving emails from it, > send an email to [hidden email]. > > > > > > > For more options, visit https://groups.google.com/d/optout. > > > > > > > > > > > > -- > > You received this message because you are subscribed to the Google > Groups "julia-stats" group. > > > To unsubscribe from this group and stop receiving emails from it, send an > email to [hidden email]. > > > For more options, visit https://groups.google.com/d/optout. > > -- > You received this message because you are subscribed to the Google Groups > "julia-stats" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [hidden email]. > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
Free forum by Nabble | Edit this page |