specifying columns

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

specifying columns

Stefan Karpinski
Here's a thought about making expressions like `foo ~ bar + baz` where "foo", "bar" and "baz" are names of columns that are expected to be in something like a data frame. The main issue here is that the meaning of this code isn't locally determined if which names are populated from columns of data frames – i.e. adding column named "sin" could change the meaning of code that uses "sin" assuming that it's the Base.sin function. What if you just declare columns in the current scope and then they're just normal variables. As in

function frizz(df::DataFrames)
  @columns foo bar baz
  fit(model, foo ~ bar + baz + 1)
end

You could do the same thing in the REPL just by writing `@columns foo bar baz` once and then those will be globals that tell expressions that need to know which names should be mapped to columns that those are the ones.

One issue is evaluation time – when do you evaluate formula expressions? If you want to expand the formula to an expression that has the data frame access in it at macro expansion time, then the @columns macro has to register those names at the same time. That may be ok, actually, since both are evaluated in the same "phase". But it would be a little cleaner if @columns foo bar baz was just shorthand for something simple like:

foo = Column(:foo)
bar = Column(:bar)
baz = Column(:baz)

That way normal scoping rules would determine the meaning of names, which is desirable. It's a slightly fuzzy idea, but it's the best one I've come up with for this ongoing problem.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: specifying columns

John Myles White
Hi Stefan,

Have you looked at DataFramesMeta? https://github.com/JuliaStats/DataFramesMeta.jl

I think it’s the best approach to what I’m come to think of the “joint namespace” problem in which you want access to the current namespace along with a pseudo-namespace defined by the columns of a DataFrame.

In the context of formulas, I think it’s better to just insist unequivocally that the columns referenced in a formula must be columns of the source DataFrame.

 — John

On Aug 26, 2014, at 8:21 AM, Stefan Karpinski <[hidden email]> wrote:

Here's a thought about making expressions like `foo ~ bar + baz` where "foo", "bar" and "baz" are names of columns that are expected to be in something like a data frame. The main issue here is that the meaning of this code isn't locally determined if which names are populated from columns of data frames – i.e. adding column named "sin" could change the meaning of code that uses "sin" assuming that it's the Base.sin function. What if you just declare columns in the current scope and then they're just normal variables. As in

function frizz(df::DataFrames)
  @columns foo bar baz
  fit(model, foo ~ bar + baz + 1)
end

You could do the same thing in the REPL just by writing `@columns foo bar baz` once and then those will be globals that tell expressions that need to know which names should be mapped to columns that those are the ones.

One issue is evaluation time – when do you evaluate formula expressions? If you want to expand the formula to an expression that has the data frame access in it at macro expansion time, then the @columns macro has to register those names at the same time. That may be ok, actually, since both are evaluated in the same "phase". But it would be a little cleaner if @columns foo bar baz was just shorthand for something simple like:

foo = Column(:foo)
bar = Column(:bar)
baz = Column(:baz)

That way normal scoping rules would determine the meaning of names, which is desirable. It's a slightly fuzzy idea, but it's the best one I've come up with for this ongoing problem.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: specifying columns

Stefan Karpinski
Using symbols to indicate that something comes from a data frame feels off to me. What do you do if you need to write code that actually uses a literal symbol? I guess you just can't / don't do that in these contexts?


On Tue, Aug 26, 2014 at 11:26 AM, John Myles White <[hidden email]> wrote:
Hi Stefan,

Have you looked at DataFramesMeta? https://github.com/JuliaStats/DataFramesMeta.jl

I think it’s the best approach to what I’m come to think of the “joint namespace” problem in which you want access to the current namespace along with a pseudo-namespace defined by the columns of a DataFrame.

In the context of formulas, I think it’s better to just insist unequivocally that the columns referenced in a formula must be columns of the source DataFrame.

 — John

On Aug 26, 2014, at 8:21 AM, Stefan Karpinski <[hidden email]> wrote:

Here's a thought about making expressions like `foo ~ bar + baz` where "foo", "bar" and "baz" are names of columns that are expected to be in something like a data frame. The main issue here is that the meaning of this code isn't locally determined if which names are populated from columns of data frames – i.e. adding column named "sin" could change the meaning of code that uses "sin" assuming that it's the Base.sin function. What if you just declare columns in the current scope and then they're just normal variables. As in

function frizz(df::DataFrames)
  @columns foo bar baz
  fit(model, foo ~ bar + baz + 1)
end

You could do the same thing in the REPL just by writing `@columns foo bar baz` once and then those will be globals that tell expressions that need to know which names should be mapped to columns that those are the ones.

One issue is evaluation time – when do you evaluate formula expressions? If you want to expand the formula to an expression that has the data frame access in it at macro expansion time, then the @columns macro has to register those names at the same time. That may be ok, actually, since both are evaluated in the same "phase". But it would be a little cleaner if @columns foo bar baz was just shorthand for something simple like:

foo = Column(:foo)
bar = Column(:bar)
baz = Column(:baz)

That way normal scoping rules would determine the meaning of names, which is desirable. It's a slightly fuzzy idea, but it's the best one I've come up with for this ongoing problem.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: specifying columns

John Myles White
Use symbol(“a”) as in @select(df, :a == symbol(“a”))?

 — John

On Aug 26, 2014, at 8:30 AM, Stefan Karpinski <[hidden email]> wrote:

Using symbols to indicate that something comes from a data frame feels off to me. What do you do if you need to write code that actually uses a literal symbol? I guess you just can't / don't do that in these contexts?


On Tue, Aug 26, 2014 at 11:26 AM, John Myles White <[hidden email]> wrote:
Hi Stefan,

Have you looked at DataFramesMeta? https://github.com/JuliaStats/DataFramesMeta.jl

I think it’s the best approach to what I’m come to think of the “joint namespace” problem in which you want access to the current namespace along with a pseudo-namespace defined by the columns of a DataFrame.

In the context of formulas, I think it’s better to just insist unequivocally that the columns referenced in a formula must be columns of the source DataFrame.

 — John

On Aug 26, 2014, at 8:21 AM, Stefan Karpinski <[hidden email]> wrote:

Here's a thought about making expressions like `foo ~ bar + baz` where "foo", "bar" and "baz" are names of columns that are expected to be in something like a data frame. The main issue here is that the meaning of this code isn't locally determined if which names are populated from columns of data frames – i.e. adding column named "sin" could change the meaning of code that uses "sin" assuming that it's the Base.sin function. What if you just declare columns in the current scope and then they're just normal variables. As in

function frizz(df::DataFrames)
  @columns foo bar baz
  fit(model, foo ~ bar + baz + 1)
end

You could do the same thing in the REPL just by writing `@columns foo bar baz` once and then those will be globals that tell expressions that need to know which names should be mapped to columns that those are the ones.

One issue is evaluation time – when do you evaluate formula expressions? If you want to expand the formula to an expression that has the data frame access in it at macro expansion time, then the @columns macro has to register those names at the same time. That may be ok, actually, since both are evaluated in the same "phase". But it would be a little cleaner if @columns foo bar baz was just shorthand for something simple like:

foo = Column(:foo)
bar = Column(:bar)
baz = Column(:baz)

That way normal scoping rules would determine the meaning of names, which is desirable. It's a slightly fuzzy idea, but it's the best one I've come up with for this ongoing problem.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: specifying columns

Stefan Karpinski
Hmm. I suppose that works.


On Tue, Aug 26, 2014 at 11:34 AM, John Myles White <[hidden email]> wrote:
Use symbol(“a”) as in @select(df, :a == symbol(“a”))?

 — John


On Aug 26, 2014, at 8:30 AM, Stefan Karpinski <[hidden email]> wrote:

Using symbols to indicate that something comes from a data frame feels off to me. What do you do if you need to write code that actually uses a literal symbol? I guess you just can't / don't do that in these contexts?


On Tue, Aug 26, 2014 at 11:26 AM, John Myles White <[hidden email]> wrote:
Hi Stefan,

Have you looked at DataFramesMeta? https://github.com/JuliaStats/DataFramesMeta.jl

I think it’s the best approach to what I’m come to think of the “joint namespace” problem in which you want access to the current namespace along with a pseudo-namespace defined by the columns of a DataFrame.

In the context of formulas, I think it’s better to just insist unequivocally that the columns referenced in a formula must be columns of the source DataFrame.

 — John

On Aug 26, 2014, at 8:21 AM, Stefan Karpinski <[hidden email]> wrote:

Here's a thought about making expressions like `foo ~ bar + baz` where "foo", "bar" and "baz" are names of columns that are expected to be in something like a data frame. The main issue here is that the meaning of this code isn't locally determined if which names are populated from columns of data frames – i.e. adding column named "sin" could change the meaning of code that uses "sin" assuming that it's the Base.sin function. What if you just declare columns in the current scope and then they're just normal variables. As in

function frizz(df::DataFrames)
  @columns foo bar baz
  fit(model, foo ~ bar + baz + 1)
end

You could do the same thing in the REPL just by writing `@columns foo bar baz` once and then those will be globals that tell expressions that need to know which names should be mapped to columns that those are the ones.

One issue is evaluation time – when do you evaluate formula expressions? If you want to expand the formula to an expression that has the data frame access in it at macro expansion time, then the @columns macro has to register those names at the same time. That may be ok, actually, since both are evaluated in the same "phase". But it would be a little cleaner if @columns foo bar baz was just shorthand for something simple like:

foo = Column(:foo)
bar = Column(:bar)
baz = Column(:baz)

That way normal scoping rules would determine the meaning of names, which is desirable. It's a slightly fuzzy idea, but it's the best one I've come up with for this ongoing problem.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: specifying columns

John Myles White
Following up on this: I really think the use of symbols as done in DataFramesMeta is the best general solution since it solves the two-scope problem in which you want might access to both the column named x and a local variable named x.

 — John

On Aug 26, 2014, at 8:35 AM, Stefan Karpinski <[hidden email]> wrote:

Hmm. I suppose that works.


On Tue, Aug 26, 2014 at 11:34 AM, John Myles White <[hidden email]> wrote:
Use symbol(“a”) as in @select(df, :a == symbol(“a”))?
 — John


On Aug 26, 2014, at 8:30 AM, Stefan Karpinski <[hidden email]> wrote:

Using symbols to indicate that something comes from a data frame feels off to me. What do you do if you need to write code that actually uses a literal symbol? I guess you just can't / don't do that in these contexts?


On Tue, Aug 26, 2014 at 11:26 AM, John Myles White <[hidden email]> wrote:
Hi Stefan,

Have you looked at DataFramesMeta? https://github.com/JuliaStats/DataFramesMeta.jl

I think it’s the best approach to what I’m come to think of the “joint namespace” problem in which you want access to the current namespace along with a pseudo-namespace defined by the columns of a DataFrame.

In the context of formulas, I think it’s better to just insist unequivocally that the columns referenced in a formula must be columns of the source DataFrame.

 — John

On Aug 26, 2014, at 8:21 AM, Stefan Karpinski <[hidden email]> wrote:

Here's a thought about making expressions like `foo ~ bar + baz` where "foo", "bar" and "baz" are names of columns that are expected to be in something like a data frame. The main issue here is that the meaning of this code isn't locally determined if which names are populated from columns of data frames – i.e. adding column named "sin" could change the meaning of code that uses "sin" assuming that it's the Base.sin function. What if you just declare columns in the current scope and then they're just normal variables. As in

function frizz(df::DataFrames)
  @columns foo bar baz
  fit(model, foo ~ bar + baz + 1)
end

You could do the same thing in the REPL just by writing `@columns foo bar baz` once and then those will be globals that tell expressions that need to know which names should be mapped to columns that those are the ones.

One issue is evaluation time – when do you evaluate formula expressions? If you want to expand the formula to an expression that has the data frame access in it at macro expansion time, then the @columns macro has to register those names at the same time. That may be ok, actually, since both are evaluated in the same "phase". But it would be a little cleaner if @columns foo bar baz was just shorthand for something simple like:

foo = Column(:foo)
bar = Column(:bar)
baz = Column(:baz)

That way normal scoping rules would determine the meaning of names, which is desirable. It's a slightly fuzzy idea, but it's the best one I've come up with for this ongoing problem.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.