Feather - a fast on-disk format for data frames

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Feather - a fast on-disk format for data frames

Douglas Bates
Wes McKinney and Hadley Wickham jointly developed a on-disk format for use storing and retrieving data frames from pandas in python and from R.  See http://blog.rstudio.org/

Sounds like something we should consider soon.  I haven't looked at the code yet but plan to do so soon.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Feather - a fast on-disk format for data frames

Shashi Gowda

Tanmay is working on wrapping a similar format called parquet. https://github.com/tanmaykm/Parquet.jl it's a bit more sophisticated than feather

I wonder why they don't time the writes in the blog post.


On Wed 30 Mar, 2016, 2:52 AM Douglas Bates, <[hidden email]> wrote:
Wes McKinney and Hadley Wickham jointly developed a on-disk format for use storing and retrieving data frames from pandas in python and from R.  See http://blog.rstudio.org/

Sounds like something we should consider soon.  I haven't looked at the code yet but plan to do so soon.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Feather - a fast on-disk format for data frames

jock.lawrie
I'm unclear what this provides that say SQLite doesn't. Thoughts?


On Wednesday, March 30, 2016 at 4:36:10 PM UTC+11, Shashi Gowda wrote:

Tanmay is working on wrapping a similar format called parquet. <a href="https://github.com/tanmaykm/Parquet.jl" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Ftanmaykm%2FParquet.jl\46sa\75D\46sntz\0751\46usg\75AFQjCNGxteku-i7llB6i57DmHHbqOJOs4Q&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Ftanmaykm%2FParquet.jl\46sa\75D\46sntz\0751\46usg\75AFQjCNGxteku-i7llB6i57DmHHbqOJOs4Q&#39;;return true;">https://github.com/tanmaykm/Parquet.jl it's a bit more sophisticated than feather

I wonder why they don't time the writes in the blog post.


On Wed 30 Mar, 2016, 2:52 AM Douglas Bates, <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="XVSMo9ZyBwAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">dmb...@...> wrote:
Wes McKinney and Hadley Wickham jointly developed a on-disk format for use storing and retrieving data frames from pandas in python and from R.  See <a href="http://blog.rstudio.org/" target="_blank" rel="nofollow" onmousedown="this.href=&#39;http://www.google.com/url?q\75http%3A%2F%2Fblog.rstudio.org%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNFY9QjbC21wukkETyVqjMZV6wiK9g&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\75http%3A%2F%2Fblog.rstudio.org%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNFY9QjbC21wukkETyVqjMZV6wiK9g&#39;;return true;">http://blog.rstudio.org/

Sounds like something we should consider soon.  I haven't looked at the code yet but plan to do so soon.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="XVSMo9ZyBwAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">julia-stats...@googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Feather - a fast on-disk format for data frames

Douglas Bates
The Arrow format and hence the feather format is columnar.

Also, a big selling point for this format is that it can be used from Python/pandas and from R.  A Julia package to read and write this format would be very useful for data exchange.


On Wednesday, March 30, 2016 at 1:06:59 AM UTC-5, [hidden email] wrote:
I'm unclear what this provides that say SQLite doesn't. Thoughts?


On Wednesday, March 30, 2016 at 4:36:10 PM UTC+11, Shashi Gowda wrote:

Tanmay is working on wrapping a similar format called parquet. <a href="https://github.com/tanmaykm/Parquet.jl" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Ftanmaykm%2FParquet.jl\46sa\75D\46sntz\0751\46usg\75AFQjCNGxteku-i7llB6i57DmHHbqOJOs4Q&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Ftanmaykm%2FParquet.jl\46sa\75D\46sntz\0751\46usg\75AFQjCNGxteku-i7llB6i57DmHHbqOJOs4Q&#39;;return true;">https://github.com/tanmaykm/Parquet.jl it's a bit more sophisticated than feather

I wonder why they don't time the writes in the blog post.


On Wed 30 Mar, 2016, 2:52 AM Douglas Bates, <[hidden email]> wrote:
Wes McKinney and Hadley Wickham jointly developed a on-disk format for use storing and retrieving data frames from pandas in python and from R.  See <a href="http://blog.rstudio.org/" rel="nofollow" target="_blank" onmousedown="this.href=&#39;http://www.google.com/url?q\75http%3A%2F%2Fblog.rstudio.org%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNFY9QjbC21wukkETyVqjMZV6wiK9g&#39;;return true;" onclick="this.href=&#39;http://www.google.com/url?q\75http%3A%2F%2Fblog.rstudio.org%2F\46sa\75D\46sntz\0751\46usg\75AFQjCNFY9QjbC21wukkETyVqjMZV6wiK9g&#39;;return true;">http://blog.rstudio.org/

Sounds like something we should consider soon.  I haven't looked at the code yet but plan to do so soon.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to julia-stats...@googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Feather - a fast on-disk format for data frames

Douglas Bates
I have been playing with the feather format in Julia for a few days without great success.  At present the implementation is a C++ library which is somewhat beyond my ability to grok.  I could offer some choice comments on C++ here but I think I will just go back to programming in Julia.  My rudimentary efforts are in https://github.com/JuliaStats/Feather.jl.

To go any further I think I would need to decide to either go full bore C++ coding generating Julia objects within my compiled code, which is feasible but doesn't sound like a whole lot of fun, or figure out how to parse the metadata without going through the Flatbuffers-generated C++ code.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Feather - a fast on-disk format for data frames

Cedric St-Jean-2
figure out how to parse the metadata without going through the Flatbuffers-generated C++ code.

By that, you mean a pure-Julia solution for reading the files?

On Friday, April 1, 2016 at 2:12:15 PM UTC-4, Douglas Bates wrote:
I have been playing with the feather format in Julia for a few days without great success.  At present the implementation is a C++ library which is somewhat beyond my ability to grok.  I could offer some choice comments on C++ here but I think I will just go back to programming in Julia.  My rudimentary efforts are in <a href="https://github.com/JuliaStats/Feather.jl" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2FJuliaStats%2FFeather.jl\46sa\75D\46sntz\0751\46usg\75AFQjCNE6agGQlSuSHxlgsGtYJjSi5XsCCg&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2FJuliaStats%2FFeather.jl\46sa\75D\46sntz\0751\46usg\75AFQjCNE6agGQlSuSHxlgsGtYJjSi5XsCCg&#39;;return true;">https://github.com/JuliaStats/Feather.jl.

To go any further I think I would need to decide to either go full bore C++ coding generating Julia objects within my compiled code, which is feasible but doesn't sound like a whole lot of fun, or figure out how to parse the metadata without going through the Flatbuffers-generated C++ code.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Feather - a fast on-disk format for data frames

Douglas Bates
Wes McKinney added a C CPI to libfeather and I was able to use that in the Feather.jl package that is under development.

On Friday, April 1, 2016 at 7:28:25 PM UTC-5, Cedric St-Jean wrote:
figure out how to parse the metadata without going through the Flatbuffers-generated C++ code.

By that, you mean a pure-Julia solution for reading the files?

On Friday, April 1, 2016 at 2:12:15 PM UTC-4, Douglas Bates wrote:
I have been playing with the feather format in Julia for a few days without great success.  At present the implementation is a C++ library which is somewhat beyond my ability to grok.  I could offer some choice comments on C++ here but I think I will just go back to programming in Julia.  My rudimentary efforts are in <a href="https://github.com/JuliaStats/Feather.jl" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2FJuliaStats%2FFeather.jl\46sa\75D\46sntz\0751\46usg\75AFQjCNE6agGQlSuSHxlgsGtYJjSi5XsCCg&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2FJuliaStats%2FFeather.jl\46sa\75D\46sntz\0751\46usg\75AFQjCNE6agGQlSuSHxlgsGtYJjSi5XsCCg&#39;;return true;">https://github.com/JuliaStats/Feather.jl.

To go any further I think I would need to decide to either go full bore C++ coding generating Julia objects within my compiled code, which is feasible but doesn't sound like a whole lot of fun, or figure out how to parse the metadata without going through the Flatbuffers-generated C++ code.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Feather - a fast on-disk format for data frames

Douglas Bates
Eventually I said "to hell with it" and wrote a Julia package for reading binary files created according to a flatbuffers IDL file, which is how the metadata in a feather file is stored.  The current Feather.Reader is working, more or less.  It will need to be polished and documented but I am very happy with having a native Julia implementation all the way down.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Feather - a fast on-disk format for data frames

Stefan Karpinski
I'll have to try this out this week. Impressive work (as usual).

On Fri, Apr 15, 2016 at 6:34 PM, Douglas Bates <[hidden email]> wrote:
Eventually I said "to hell with it" and wrote a Julia package for reading binary files created according to a flatbuffers IDL file, which is how the metadata in a feather file is stored.  The current Feather.Reader is working, more or less.  It will need to be polished and documented but I am very happy with having a native Julia implementation all the way down.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Feather - a fast on-disk format for data frames

Rob J. Goedman
In reply to this post by Douglas Bates
Hi Doug,

A somewhat related discussion is taking place on the stan-dev list so I have been following your work on Flatbuffers.jl and Feather.jl a bit. 

The previous version (where I had to provide libfeather.dylib) worked fine, the flatbuffers version currently seems to read the meta data only. Is that correct or do I need more steps?

Regards,
Rob


julia> using Feather

julia> rr = Reader(Pkg.dir("Feather", "test", "data", "iris.feather"))
[150 × 5] @ /Users/rob/.julia/v0.4/Feather/test/data/iris.feather


julia> rr
[150 × 5] @ /Users/rob/.julia/v0.4/Feather/test/data/iris.feather


On Apr 15, 2016, at 15:34, Douglas Bates <[hidden email]> wrote:

Eventually I said "to hell with it" and wrote a Julia package for reading binary files created according to a flatbuffers IDL file, which is how the metadata in a feather file is stored.  The current Feather.Reader is working, more or less.  It will need to be polished and documented but I am very happy with having a native Julia implementation all the way down.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Feather - a fast on-disk format for data frames

Douglas Bates
Andreas Noack and I have been plugging away on a version that uses Keno's Cxx package, which means it can only be used with Julia 0.5-

See the dmb/cxx branch of the repository, which was forked from the anj/cxx branch today.

One remarkable aspect of this is that it doesn't use the feather C++ library, it only uses the header files flatbuffers.h and metadata_generated.h, which is generated by the flatc compiler from metadata.fbs.

I think this version handles the missing data correctly but that hasn't been extensively tested.  I can detect when a category is stored but have not yet parsed the category metadata itself.

On Wednesday, April 20, 2016 at 9:12:55 AM UTC-5, Rob J Goedman wrote:
Hi Doug,

A somewhat related discussion is taking place on the stan-dev list so I have been following your work on Flatbuffers.jl and Feather.jl a bit. 

The previous version (where I had to provide libfeather.dylib) worked fine, the flatbuffers version currently seems to read the meta data only. Is that correct or do I need more steps?

Regards,
Rob


julia> using Feather

julia> rr = Reader(Pkg.dir("Feather", "test", "data", "iris.feather"))
[150 × 5] @ /Users/rob/.julia/v0.4/Feather/test/data/iris.feather


julia> rr
[150 × 5] @ /Users/rob/.julia/v0.4/Feather/test/data/iris.feather


On Apr 15, 2016, at 15:34, Douglas Bates <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="nJSaN5raHgAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">dmb...@...> wrote:

Eventually I said "to hell with it" and wrote a Julia package for reading binary files created according to a flatbuffers IDL file, which is how the metadata in a feather file is stored.  The current Feather.Reader is working, more or less.  It will need to be polished and documented but I am very happy with having a native Julia implementation all the way down.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="nJSaN5raHgAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">julia-stats...@googlegroups.com.
For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Feather - a fast on-disk format for data frames

Rob J. Goedman
Thanks Doug,

Sounds promising. I’ll hold off for a while to see if the cxx stuff settles.

Until now I have been trying to avoid going the C++ route to interface with Stan and I was kind of hoping flat buffers (for in memory communication) and feather (slower, but if needed permanent, disk storage) would help out here.

Regards,
Rob

On Apr 24, 2016, at 15:11, Douglas Bates <[hidden email]> wrote:

Andreas Noack and I have been plugging away on a version that uses Keno's Cxx package, which means it can only be used with Julia 0.5-

See the dmb/cxx branch of the repository, which was forked from the anj/cxx branch today.

One remarkable aspect of this is that it doesn't use the feather C++ library, it only uses the header files flatbuffers.h and metadata_generated.h, which is generated by the flatc compiler from metadata.fbs.

I think this version handles the missing data correctly but that hasn't been extensively tested.  I can detect when a category is stored but have not yet parsed the category metadata itself.

On Wednesday, April 20, 2016 at 9:12:55 AM UTC-5, Rob J Goedman wrote:
Hi Doug,

A somewhat related discussion is taking place on the stan-dev list so I have been following your work on Flatbuffers.jl and Feather.jl a bit. 

The previous version (where I had to provide libfeather.dylib) worked fine, the flatbuffers version currently seems to read the meta data only. Is that correct or do I need more steps?

Regards,
Rob


julia> using Feather

julia> rr = Reader(Pkg.dir("Feather", "test", "data", "iris.feather"))
[150 × 5] @ /Users/rob/.julia/v0.4/Feather/test/data/iris.feather


julia> rr
[150 × 5] @ /Users/rob/.julia/v0.4/Feather/test/data/iris.feather


On Apr 15, 2016, at 15:34, Douglas Bates <dmb...@gmail.com> wrote:

Eventually I said "to hell with it" and wrote a Julia package for reading binary files created according to a flatbuffers IDL file, which is how the metadata in a feather file is stored.  The current Feather.Reader is working, more or less.  It will need to be polished and documented but I am very happy with having a native Julia implementation all the way down.

-- 
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to julia-stats...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


-- 
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Feather - a fast on-disk format for data frames

Douglas Bates
Okay, so we have a feather file reader for Julia 0.5- using the Cxx package.

$ julia5
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.5.0-dev+3782 (2016-04-28 12:43 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit e8601e8 (0 days old master)
|__/                   |  x86_64-linux-gnu

julia> Pkg.clone("https://github.com/JuliaStats/Feather.jl")
INFO: Cloning Feather from https://github.com/JuliaStats/Feather.jl
INFO: Computing changes...
INFO: Installing FlatBuffers v0.0.1

julia> Pkg.checkout("Feather", "dmb/cxx")
INFO: Checking out Feather dmb/cxx...
INFO: Pulling Feather latest dmb/cxx...
INFO: Removing FlatBuffers v0.0.1

julia> using Feather
WARNING: cfunction: process_cxx_exception does not returnWARNING: New definition 
    size(DataFrames.ModelMatrix, Any...) at /home/bates/.julia/v0.5/DataFrames/src/statsmodels/formula.jl:48
is ambiguous with: 
    size(Any, Integer, Integer, Integer...) at abstractarray.jl:23.
To fix, define 
    size(DataFrames.ModelMatrix, Integer, Integer, Integer...)
before the new definition.

julia> rr = Feather.Reader(Pkg.dir("Feather", "test", "data", "BOD.feather"))
[6 × 2] @ /home/bates/.julia/v0.5/Feather/test/data/BOD.feather
 Time    : Float64
 demand  : Float64


julia> BOD = DataFrame(rr)
6x2 DataFrames.DataFrame
│ Row │ Time │ demand │
┝━━━━━┿━━━━━━┿━━━━━━━━┥
│ 1   │ 1.0  │ 8.3    │
│ 2   │ 2.0  │ 10.3   │
│ 3   │ 3.0  │ 19.0   │
│ 4   │ 4.0  │ 16.0   │
│ 5   │ 5.0  │ 15.6   │
│ 6   │ 7.0  │ 19.8   │

As mentioned in the README, in the interests of speed the column contents are memory-mapped arrays pointing to the contents of the file on disk.  This means you can read big arrays very quickly.  However, if the Feather.Reader object is garbage collected you lose the contents of the columns.  Use deepcopy if you want to be safe rather than fast.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.