How to order categorical variable in dataframe but not alphabetically

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

How to order categorical variable in dataframe but not alphabetically

Pedro L Vera
Hi,

I'm a new user to Julia and I have been stuck on a problem for a few days.

I have a data frame that has a categorical variable (call it "Group") and numerical variables (e.g. "Score"). 

I have used 

pool!(df2, [:Group])

to create the levels so that I can run

lm1 =  lm1 = fit(LinearModel, Score ~ Group, df2)

The regression works fine and my comparisons across the levels of my categorical variable look great (I checked the output against R and it looks the same).

What I would like to do and have not been able to do is order the categorical variable, but not alphabetically. In other words, if the levels in Group are "A","B","C", I want to order them as "A","C","B".

This will line up my level comparisons in the regression output the way I want but more importantly, it will boxplot in the order that I want.

I have read the Dataframes doc and have tried several formulations of "by" or "order" but I always end up with an error. I guess I can't figure out the syntax.

In R,this would be:

Group=factor(df2$Group,levels=c("A","C","B"))

and the order of the levels would be set for all subsequent regressions and graphics.

Is there a similar way to do this in julia?

Thanks for the help.

Pedro

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: How to order categorical variable in dataframe but not alphabetically

Milan Bouchet-Valat
Le vendredi 11 septembre 2015 à 13:04 -0700, Pedro L Vera a écrit :

> Hi,
>
> I'm a new user to Julia and I have been stuck on a problem for a few
> days.
>
> I have a data frame that has a categorical variable (call it "Group")
> and numerical variables (e.g. "Score").
>
> I have used
>
> pool!(df2, [:Group])
>
> to create the levels so that I can run
>
> lm1 =  lm1 = fit(LinearModel, Score ~ Group, df2)
>
> The regression works fine and my comparisons across the levels of my
> categorical variable look great (I checked the output against R and
> it looks the same).
>
> What I would like to do and have not been able to do is order the
> categorical variable, but not alphabetically. In other words, if the
> levels in Group are "A","B","C", I want to order them as "A","C","B".
>
> This will line up my level comparisons in the regression output the
> way I want but more importantly, it will boxplot in the order that I
> want.
>
> I have read the Dataframes doc and have tried several formulations of
> "by" or "order" but I always end up with an error. I guess I can't
> figure out the syntax.
>
> In R,this would be:
>
> Group=factor(df2$Group,levels=c("A","C","B"))
>
> and the order of the levels would be set for all subsequent
> regressions and graphics.
>
> Is there a similar way to do this in julia?
You can call setlevels!() after the fact. I agree it would make sense
to accept a list of levels as an argument. Could you file an issue on
GitHub against DataFrames.jl?


Regards

> Thanks for the help.
>
> Pedro
>
> --
> You received this message because you are subscribed to the Google
> Groups "julia-stats" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to [hidden email].
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: How to order categorical variable in dataframe but not alphabetically

Pedro L Vera
Thanks for the reply. Have not been able to find documentation on 'set!levels'

I tried:

setlevels!(df2[:Group], "A","C","B"])

and that returns:

`setlevels!` has no method matching setlevels!(::PooledDataArray{UTF8String,Uint8,1}, ::Array{ASCIIString,1})
while loading In[47], in expression starting on line 1

setlevels(df2[:Group], "A","C","B"])

does work and returns the order I want on the variable "Group". However, that messed up my next step which is the regression.

Thanks.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: How to order categorical variable in dataframe but not alphabetically

Milan Bouchet-Valat
Le vendredi 11 septembre 2015 à 15:44 -0700, Pedro L Vera a écrit :
> Thanks for the reply. Have not been able to find documentation on
> 'set!levels'
Yeah, the docs are quite incomplete at the moment.

> I tried:
>
> setlevels!(df2[:Group], "A","C","B"])
>
> and that returns:
>
> `setlevels!` has no method matching
> setlevels!(::PooledDataArray{UTF8String,Uint8,1},
> ::Array{ASCIIString,1})
> while loading In[47], in expression starting on line 1
>
> setlevels(df2[:Group], "A","C","B"])
>
> does work and returns the order I want on the variable "Group".
> However, that messed up my next step which is the regression.
Ah, that's a silly issue because ASCIIString != UTF8String. This should
work:

setlevels!(df2[:Group], UTF8String["A","C","B"])

(The default string type will most likely stop changing depending on
whether there are only ASCII characters or not in a future Julia
release.)


Regards


> Thanks.
> --
> You received this message because you are subscribed to the Google
> Groups "julia-stats" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to [hidden email].
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: How to order categorical variable in dataframe but not alphabetically

Pedro L Vera
Hi Milan,

Thanks for all the help. Yes, that indeed does work!

Yet it reorders that variable only, not the rest of the dataframe. In other words, if my original dataframe showed:

Group      Score
A             1
B             2
C             3

After setlevels!, it shows:

Group      Score
A             1
C             2
B             3

 Is there a way to reorder the entire dataframe based on the new levels? Otherwise my regression (and graphs) are incorrect since they will show incorrect labels for the coefficients.

I apologize if I'm asking very newbie questions.

Regards,

Pedro

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: How to order categorical variable in dataframe but not alphabetically

Milan Bouchet-Valat
Le samedi 12 septembre 2015 à 03:59 -0700, Pedro L Vera a écrit :

> Hi Milan,
>
> Thanks for all the help. Yes, that indeed does work!
>
> Yet it reorders that variable only, not the rest of the dataframe. In
> other words, if my original dataframe showed:
>
> Group      Score
> A             1
> B             2
> C             3
>
> After setlevels!, it shows:
>
> Group      Score
> A             1
> C             2
> B             3
>
>  Is there a way to reorder the entire dataframe based on the new
> levels? Otherwise my regression (and graphs) are incorrect since they
> will show incorrect labels for the coefficients.
>
> I apologize if I'm asking very newbie questions.
Don't worry, the problem is rather with the lacking documentation. We
really need help in that area.

Actually, setlevels!() isn't what you need here: it merely replaces the
levels with different values, but doesn't allow changing their order
AFAICT.

So it appears you need to call PooledDataArray():

julia> df = DataFrame(x=pool(["A", "B", "C"]), y=[1, 2, 3])
3x2 DataFrames.DataFrame
| Row | x   | y |
|-----|-----|---|
| 1   | "A" | 1 |
| 2   | "B" | 2 |
| 3   | "C" | 3 |

julia> df[:x] = PooledDataArray(df[:x], ["C", "B", "A"])
3-element DataArrays.PooledDataArray{ASCIIString,UInt8,1}:
 "A"
 "B"
 "C"

julia> df
3x2 DataFrames.DataFrame
| Row | x   | y |
|-----|-----|---|
| 1   | "A" | 1 |
| 2   | "B" | 2 |
| 3   | "C" | 3 |

julia> levels(df[:x])
3-element Array{ASCIIString,1}:
 "C"
 "B"
 "A"


Regards


> Regards,
>
> Pedro
> --
> You received this message because you are subscribed to the Google
> Groups "julia-stats" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to [hidden email].
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: How to order categorical variable in dataframe but not alphabetically

Pedro L Vera
HI Milan:

Thanks so much for all the detailed instructions. They really helped and my regression coefficients are now ordered the way I want them to be.

One small thing I added to the code was "UTF8String" since that made it work:

 julia> df[:x] = PooledDataArray(df[:x], UTF8String["C", "B", "A"]) 





julia> df[:x] = PooledDataArray(df[:x], UTF8String["C", "B", "A"])
 

julia> df
3x2 DataFrames.DataFrame
| Row | x   | y |
|-----|-----|---|
| 1   | "A" | 1 |
| 2   | "B" | 2 |
| 3   | "C" | 3 |

julia> levels(df[:x])
3-element Array{ASCIIString,1}:
 "C"
 "B"
 "A"


Regards


> Regards,
>
> Pedro
> --
> You received this message because you are subscribed to the Google
> Groups "julia-stats" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="LVtpzVRNAQAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">julia-stats...@googlegroups.com.
> For more options, visit <a href="https://groups.google.com/d/optout" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;" onclick="this.href=&#39;https://groups.google.com/d/optout&#39;;return true;">https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: How to order categorical variable in dataframe but not alphabetically

Pedro L Vera
Oops, hit post too quickly.

The regression runs great now with the desired order.

Oddly  though, the order does not carry over to my plot. I"m using gadfly, and if I run:

p1= plot(df2, x="Group", y="Score", Geom.boxplot,
Guide.ylabel("Scores"), # label for y-axis
Guide.title("Scores by Groups"))

the groups are still ordered alphabetically. I also tried:

p2= plot(df2, x=PooledDataArray(df2[:Group], UTF8String["A","B","C"]), y="Score", ,,,,

While that did indeed reorder the groups in the boxplot, the plots were incorrect (did not match the ranges shown in p1).

So, the ordering does not carry over to gadfly.

Thanks.

Pedro

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: How to order categorical variable in dataframe but not alphabetically

Milan Bouchet-Valat
Le dimanche 13 septembre 2015 à 03:14 -0700, Pedro L Vera a écrit :

> Oops, hit post too quickly.
>
> The regression runs great now with the desired order.
>
> Oddly  though, the order does not carry over to my plot. I"m using
> gadfly, and if I run:
>
> p1= plot(df2, x="Group", y="Score", Geom.boxplot,
> Guide.ylabel("Scores"), # label for y-axis
> Guide.title("Scores by Groups"))
>
> the groups are still ordered alphabetically. I also tried:
>
> p2= plot(df2, x=PooledDataArray(df2[:Group],
> UTF8String["A","B","C"]), y="Score", ,,,,
>
> While that did indeed reorder the groups in the boxplot, the plots
> were incorrect (did not match the ranges shown in p1).
>
> So, the ordering does not carry over to gadfly.
Indeed, that's probably because Gadfly calls unique(), which returns
the levels in their order of appearance in the data. I couldn't find
where this happens in the code, but please file an issue on GitHub
against Gadfly.jl, this definitely needs fixing.


Regards

> Thanks.
>
> Pedro
> --
> You received this message because you are subscribed to the Google
> Groups "julia-stats" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to [hidden email].
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: How to order categorical variable in dataframe but not alphabetically

Pedro L Vera
Posted the issue already. Thanks again for all your help.

Best,


On Sunday, September 13, 2015 at 8:45:50 AM UTC-4, Milan Bouchet-Valat wrote:
Le dimanche 13 septembre 2015 à 03:14 -0700, Pedro L Vera a écrit :

> Oops, hit post too quickly.
>
> The regression runs great now with the desired order.
>
> Oddly  though, the order does not carry over to my plot. I"m using
> gadfly, and if I run:
>
> p1= plot(df2, x="Group", y="Score", Geom.boxplot,
> Guide.ylabel("Scores"), # label for y-axis
> Guide.title("Scores by Groups"))
>
> the groups are still ordered alphabetically. I also tried:
>
> p2= plot(df2, x=PooledDataArray(df2[:Group],
> UTF8String["A","B","C"]), y="Score", ,,,,
>
> While that did indeed reorder the groups in the boxplot, the plots
> were incorrect (did not match the ranges shown in p1).
>
> So, the ordering does not carry over to gadfly.
Indeed, that's probably because Gadfly calls unique(), which returns
the levels in their order of appearance in the data. I couldn't find
where this happens in the code, but please file an issue on GitHub
against Gadfly.jl, this definitely needs fixing.


Regards


--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.