PooledDataArray variant?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

PooledDataArray variant?

Jeff Bezanson
Hi all,

I recently ran into a use case for PooledDataArrays, avoiding storing
a huge number of copies of the same string. This is purely for
compression, and works beautifully for that. In my case it's also nice
to be able to use Int8 and Int16, so I'm taking advantage of that.
However, I don't want missing values (type instability being the
biggest problem) and I don't need any categorical behavior.

Would it make sense to have a version of PooledDataArray for this kind
of application? I'm imagining a PooledArrays.jl package just with this
type. Maybe something like this already exists? If not, I'd be happy
to start the legwork if the idea makes sense.

-Jeff

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: PooledDataArray variant?

Milan Bouchet-Valat
Le jeudi 26 mai 2016 à 12:37 -0400, Jeff Bezanson a écrit :

> Hi all,
>
> I recently ran into a use case for PooledDataArrays, avoiding storing
> a huge number of copies of the same string. This is purely for
> compression, and works beautifully for that. In my case it's also nice
> to be able to use Int8 and Int16, so I'm taking advantage of that.
> However, I don't want missing values (type instability being the
> biggest problem) and I don't need any categorical behavior.
>
> Would it make sense to have a version of PooledDataArray for this kind
> of application? I'm imagining a PooledArrays.jl package just with this
> type. Maybe something like this already exists? If not, I'd be happy
> to start the legwork if the idea makes sense.
Actually, I've recently started to work on John Myles White's
CategoricalData.jl package, which is meant to replace PDAs. At the
moment, I have two different types: CategoricalArray and
NullableCategoricalArray, and the former does not support missing
values.

I wanted to polish several aspects and write some docs before making
the code available, but you can have a look at my fork here:
https://github.com/nalimilan/CategoricalData.jl

Comments welcome ! The original discussion of the design is here:
https://github.com/JuliaStats/DataArrays.jl/issues/73


Regards

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: PooledDataArray variant?

Jeff Bezanson
Is this de-coupled from the notion of categorical data? I want
something that just does the pooling optimization automatically for
all types T, without separately defining the pool or adding any new
ordering behavior. It would probably also be good to store the pool
sorted for fast lookup, but that's a bonus.

On Thu, May 26, 2016 at 1:38 PM, Milan Bouchet-Valat <[hidden email]> wrote:

> Le jeudi 26 mai 2016 à 12:37 -0400, Jeff Bezanson a écrit :
>> Hi all,
>>
>> I recently ran into a use case for PooledDataArrays, avoiding storing
>> a huge number of copies of the same string. This is purely for
>> compression, and works beautifully for that. In my case it's also nice
>> to be able to use Int8 and Int16, so I'm taking advantage of that.
>> However, I don't want missing values (type instability being the
>> biggest problem) and I don't need any categorical behavior.
>>
>> Would it make sense to have a version of PooledDataArray for this kind
>> of application? I'm imagining a PooledArrays.jl package just with this
>> type. Maybe something like this already exists? If not, I'd be happy
>> to start the legwork if the idea makes sense.
> Actually, I've recently started to work on John Myles White's
> CategoricalData.jl package, which is meant to replace PDAs. At the
> moment, I have two different types: CategoricalArray and
> NullableCategoricalArray, and the former does not support missing
> values.
>
> I wanted to polish several aspects and write some docs before making
> the code available, but you can have a look at my fork here:
> https://github.com/nalimilan/CategoricalData.jl
>
> Comments welcome ! The original discussion of the design is here:
> https://github.com/JuliaStats/DataArrays.jl/issues/73
>
>
> Regards
>
> --
> You received this message because you are subscribed to the Google Groups "julia-stats" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: PooledDataArray variant?

Milan Bouchet-Valat
Le jeudi 26 mai 2016 à 15:16 -0400, Jeff Bezanson a écrit :
> Is this de-coupled from the notion of categorical data? I want
> something that just does the pooling optimization automatically for
> all types T, without separately defining the pool or adding any new
> ordering behavior. It would probably also be good to store the pool
> sorted for fast lookup, but that's a bonus.
It depends on what you mean by "categorical data". CategoricalArray
stores a CategoricalPool, but that's mostly invisible to the user. When
indexed, it returns CategoricalValue objects which are immutable
wrappers storing the value (i.e. the string) and a reference to the
pool. In practice it should be usable as a string in many cases.

Then there's OrdinalArray, which adds an ordering to the values, by
default based on the order of appearance of the levels or on their
insertion order.

Does CategoricalArray suit your needs? The main difference with PDAs is
that it doesn't attempt to act like a standard array by supporting any
operation that the underlying type supports.


Regards

> On Thu, May 26, 2016 at 1:38 PM, Milan Bouchet-Valat
> <[hidden email]> wrote:
> >
> > Le jeudi 26 mai 2016 à 12:37 -0400, Jeff Bezanson a écrit :
> > >
> > > Hi all,
> > >
> > > I recently ran into a use case for PooledDataArrays, avoiding
> > > storing
> > > a huge number of copies of the same string. This is purely for
> > > compression, and works beautifully for that. In my case it's also
> > > nice
> > > to be able to use Int8 and Int16, so I'm taking advantage of
> > > that.
> > > However, I don't want missing values (type instability being the
> > > biggest problem) and I don't need any categorical behavior.
> > >
> > > Would it make sense to have a version of PooledDataArray for this
> > > kind
> > > of application? I'm imagining a PooledArrays.jl package just with
> > > this
> > > type. Maybe something like this already exists? If not, I'd be
> > > happy
> > > to start the legwork if the idea makes sense.
> > Actually, I've recently started to work on John Myles White's
> > CategoricalData.jl package, which is meant to replace PDAs. At the
> > moment, I have two different types: CategoricalArray and
> > NullableCategoricalArray, and the former does not support missing
> > values.
> >
> > I wanted to polish several aspects and write some docs before
> > making
> > the code available, but you can have a look at my fork here:
> > https://github.com/nalimilan/CategoricalData.jl
> >
> > Comments welcome ! The original discussion of the design is here:
> > https://github.com/JuliaStats/DataArrays.jl/issues/73
> >
> >
> > Regards
> >
> > --
> > You received this message because you are subscribed to the Google
> > Groups "julia-stats" group.
> > To unsubscribe from this group and stop receiving emails from it,
> > send an email to [hidden email].
> > For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: PooledDataArray variant?

Jeff Bezanson
> it returns CategoricalValue objects

This is the part I don't want. I want it to behave exactly like a
Vector{T}, just space-optimized.

On Thu, May 26, 2016 at 3:27 PM, Milan Bouchet-Valat <[hidden email]> wrote:

> Le jeudi 26 mai 2016 à 15:16 -0400, Jeff Bezanson a écrit :
>> Is this de-coupled from the notion of categorical data? I want
>> something that just does the pooling optimization automatically for
>> all types T, without separately defining the pool or adding any new
>> ordering behavior. It would probably also be good to store the pool
>> sorted for fast lookup, but that's a bonus.
> It depends on what you mean by "categorical data". CategoricalArray
> stores a CategoricalPool, but that's mostly invisible to the user. When
> indexed, it returns CategoricalValue objects which are immutable
> wrappers storing the value (i.e. the string) and a reference to the
> pool. In practice it should be usable as a string in many cases.
>
> Then there's OrdinalArray, which adds an ordering to the values, by
> default based on the order of appearance of the levels or on their
> insertion order.
>
> Does CategoricalArray suit your needs? The main difference with PDAs is
> that it doesn't attempt to act like a standard array by supporting any
> operation that the underlying type supports.
>
>
> Regards
>
>> On Thu, May 26, 2016 at 1:38 PM, Milan Bouchet-Valat
>> <[hidden email]> wrote:
>> >
>> > Le jeudi 26 mai 2016 à 12:37 -0400, Jeff Bezanson a écrit :
>> > >
>> > > Hi all,
>> > >
>> > > I recently ran into a use case for PooledDataArrays, avoiding
>> > > storing
>> > > a huge number of copies of the same string. This is purely for
>> > > compression, and works beautifully for that. In my case it's also
>> > > nice
>> > > to be able to use Int8 and Int16, so I'm taking advantage of
>> > > that.
>> > > However, I don't want missing values (type instability being the
>> > > biggest problem) and I don't need any categorical behavior.
>> > >
>> > > Would it make sense to have a version of PooledDataArray for this
>> > > kind
>> > > of application? I'm imagining a PooledArrays.jl package just with
>> > > this
>> > > type. Maybe something like this already exists? If not, I'd be
>> > > happy
>> > > to start the legwork if the idea makes sense.
>> > Actually, I've recently started to work on John Myles White's
>> > CategoricalData.jl package, which is meant to replace PDAs. At the
>> > moment, I have two different types: CategoricalArray and
>> > NullableCategoricalArray, and the former does not support missing
>> > values.
>> >
>> > I wanted to polish several aspects and write some docs before
>> > making
>> > the code available, but you can have a look at my fork here:
>> > https://github.com/nalimilan/CategoricalData.jl
>> >
>> > Comments welcome ! The original discussion of the design is here:
>> > https://github.com/JuliaStats/DataArrays.jl/issues/73
>> >
>> >
>> > Regards
>> >
>> > --
>> > You received this message because you are subscribed to the Google
>> > Groups "julia-stats" group.
>> > To unsubscribe from this group and stop receiving emails from it,
>> > send an email to [hidden email].
>> > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups "julia-stats" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: PooledDataArray variant?

Andreas Noack
Isn't the "categorical" interpretation part of the collection and therefore less relevant when you consider an element in isolation. What kind of operations do you have in mind for CategoricalValues?

On Thu, May 26, 2016 at 3:48 PM, Jeff Bezanson <[hidden email]> wrote:
> it returns CategoricalValue objects

This is the part I don't want. I want it to behave exactly like a
Vector{T}, just space-optimized.

On Thu, May 26, 2016 at 3:27 PM, Milan Bouchet-Valat <[hidden email]> wrote:
> Le jeudi 26 mai 2016 à 15:16 -0400, Jeff Bezanson a écrit :
>> Is this de-coupled from the notion of categorical data? I want
>> something that just does the pooling optimization automatically for
>> all types T, without separately defining the pool or adding any new
>> ordering behavior. It would probably also be good to store the pool
>> sorted for fast lookup, but that's a bonus.
> It depends on what you mean by "categorical data". CategoricalArray
> stores a CategoricalPool, but that's mostly invisible to the user. When
> indexed, it returns CategoricalValue objects which are immutable
> wrappers storing the value (i.e. the string) and a reference to the
> pool. In practice it should be usable as a string in many cases.
>
> Then there's OrdinalArray, which adds an ordering to the values, by
> default based on the order of appearance of the levels or on their
> insertion order.
>
> Does CategoricalArray suit your needs? The main difference with PDAs is
> that it doesn't attempt to act like a standard array by supporting any
> operation that the underlying type supports.
>
>
> Regards
>
>> On Thu, May 26, 2016 at 1:38 PM, Milan Bouchet-Valat
>> <[hidden email]> wrote:
>> >
>> > Le jeudi 26 mai 2016 à 12:37 -0400, Jeff Bezanson a écrit :
>> > >
>> > > Hi all,
>> > >
>> > > I recently ran into a use case for PooledDataArrays, avoiding
>> > > storing
>> > > a huge number of copies of the same string. This is purely for
>> > > compression, and works beautifully for that. In my case it's also
>> > > nice
>> > > to be able to use Int8 and Int16, so I'm taking advantage of
>> > > that.
>> > > However, I don't want missing values (type instability being the
>> > > biggest problem) and I don't need any categorical behavior.
>> > >
>> > > Would it make sense to have a version of PooledDataArray for this
>> > > kind
>> > > of application? I'm imagining a PooledArrays.jl package just with
>> > > this
>> > > type. Maybe something like this already exists? If not, I'd be
>> > > happy
>> > > to start the legwork if the idea makes sense.
>> > Actually, I've recently started to work on John Myles White's
>> > CategoricalData.jl package, which is meant to replace PDAs. At the
>> > moment, I have two different types: CategoricalArray and
>> > NullableCategoricalArray, and the former does not support missing
>> > values.
>> >
>> > I wanted to polish several aspects and write some docs before
>> > making
>> > the code available, but you can have a look at my fork here:
>> > https://github.com/nalimilan/CategoricalData.jl
>> >
>> > Comments welcome ! The original discussion of the design is here:
>> > https://github.com/JuliaStats/DataArrays.jl/issues/73
>> >
>> >
>> > Regards
>> >
>> > --
>> > You received this message because you are subscribed to the Google
>> > Groups "julia-stats" group.
>> > To unsubscribe from this group and stop receiving emails from it,
>> > send an email to [hidden email].
>> > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups "julia-stats" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.