In a discussion on julia-users about the formula language for the MixedModels package I found myself again explaining the implicit intercept in the R formula language and its ramifications. This led me to consider not having an implicit intercept and requiring the user to write y ~ 1 + x when they want an intercept and slope.
-- It is two extra characters to type which is negligible. However, it would throw off any R users who expect the implicit intercept. Which is the lesser of the two evils? I personally think that requiring an explicit intercept term would make the connection between the formula and the fitted coefficients much clearer but that doesn't mean that there wouldn't be howls of protest. You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
Le vendredi 22 août 2014 à 13:11 -0700, Douglas Bates a écrit :
In a discussion on julia-users about the formula language for the MixedModels package I found myself again explaining the implicit intercept in the R formula language and its ramifications. This led me to consider not having an implicit intercept and requiring the user to write y ~ 1 + x when they want an intercept and slope. It is two extra characters to type which is negligible. However, it would throw off any R users who expect the implicit intercept. Which is the lesser of the two evils? I personally think that requiring an explicit intercept term would make the connection between the formula and the fitted coefficients much clearer but that doesn't mean that there wouldn't be howls of protest.That's an interesting idea, but I'm concerned that it would confuse not only R users, but also users of most statistical frameworks I know of (MATLAB, Python's StatModels, SAS, Stata). People will inevitably forget to add the intercept because they take inspiration from existing material using other software, and we'll end up answering this question much more often than currently with the reverse situation. I'd say 90% of cases, you want the intercept to be included. It would be useful, though, to print the formula of the model in its expanded form, i.e. adding `1 +` when not specified. My two cents -- You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
In reply to this post by Douglas Bates
+1 (get it) But seriously, I would love this. Of course, I never learned the formula language in R and it was just this weird thing that would show up in code snippets that I never actually understood. Having the explicit `y ~ x + 1` makes it so much clearer to me as a person with a mathematical background who never specifically learned this syntax. Since Julia's formula syntax is deviating from R's already, it might be better for it to just be "its own thing" instead of a slightly "broken" version of R's formulas.
On Fri, Aug 22, 2014 at 4:11 PM, Douglas Bates <[hidden email]> wrote:
You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
I also like making the intercept explicit with a + 1.
On a related but separate, I'd also prefer that we fit the intercept without storing a column of 1's in the design matrix. This is how lots of machine learning systems like Vowpal Wabbit work, because it makes it easier to (a) work with an existing matrix of features without mutating it and (b) perform regularization of all terms except the intercept by doing things like cost += norm(coef, 1) or cost += norm(coef, 2). Stefan, if you're never learned the formula syntax (especially the lme4 variant), I'd encourage you to try it out. I'd argue that it's one of the best features of R. -- John On Aug 22, 2014, at 2:12 PM, Stefan Karpinski <[hidden email]> wrote:
You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
On Fri, Aug 22, 2014 at 5:18 PM, John Myles White <[hidden email]> wrote: Stefan, if you're never learned the formula syntax (especially the lme4 variant), I'd encourage you to try it out. I'd argue that it's one of the best features of R. Will do :-) You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
In reply to this post by Douglas Bates
I myself have at times wished the formula interface in R would require explicit intercept. It would be easier for students to map the terms to the terms of a linear formula they're already familiar with.
-- On Friday, August 22, 2014 11:11:50 PM UTC+3, Douglas Bates wrote:
You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
In reply to this post by Douglas Bates
On Friday, August 22, 2014 3:11:50 PM UTC-5, Douglas Bates wrote:
-- ...
As a user and teacher, I'd prefer the implicit intercept. The most common case should usually be the default, and regression with an intercept is overwhelmingly the common case. Users can already write "y ~ 1 + x" if they want the intercept to be made explicit, and it gives the same results as "y ~ x". I agree with John's suggestion that the intercept should not actually be stored as a column of 1s in the design matrix, though. --Gray --Gray You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
It is generally a good principle to make the default the right thing. But in this case it makes specifying that there is no intercept so bizarre – writing `y ~ x + 0` and having that be different from `y ~ x` but having `y ~ x + 1` is just strange.
On Fri, Aug 22, 2014 at 5:49 PM, Gray Calhoun <[hidden email]> wrote:
You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
In reply to this post by Johan Sigfrids
Le vendredi 22 août 2014 à 14:30 -0700, Johan Sigfrids a écrit :
> I myself have at times wished the formula interface in R would require > explicit intercept. It would be easier for students to map the terms > to the terms of a linear formula they're already familiar with. So if most people are in favor of requiring an explicit `+ 1`, what do you think of also requiring the user to specify that the intercept should not be included, using `- 1` or `+ 0 `? That way, and error could be printed when nothing is specified, which would catch the most common mistake of people coming from all other languages. Regards > On Friday, August 22, 2014 11:11:50 PM UTC+3, Douglas Bates wrote: > In a discussion on julia-users about the formula language for > the MixedModels package I found myself again explaining the > implicit intercept in the R formula language and its > ramifications. This led me to consider not having an implicit > intercept and requiring the user to write y ~ 1 + x when they > want an intercept and slope. > > > It is two extra characters to type which is negligible. > However, it would throw off any R users who expect the > implicit intercept. Which is the lesser of the two evils? I > personally think that requiring an explicit intercept term > would make the connection between the formula and the fitted > coefficients much clearer but that doesn't mean that there > wouldn't be howls of protest. > -- > You received this message because you are subscribed to the Google > Groups "julia-stats" group. > To unsubscribe from this group and stop receiving emails from it, send > an email to [hidden email]. > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
I believe not specifying anything would count as no intercept.
-- John On Aug 22, 2014, at 2:58 PM, Milan Bouchet-Valat <[hidden email]> wrote: > Le vendredi 22 août 2014 à 14:30 -0700, Johan Sigfrids a écrit : >> I myself have at times wished the formula interface in R would require >> explicit intercept. It would be easier for students to map the terms >> to the terms of a linear formula they're already familiar with. > So if most people are in favor of requiring an explicit `+ 1`, what do > you think of also requiring the user to specify that the intercept > should not be included, using `- 1` or `+ 0 `? That way, and error could > be printed when nothing is specified, which would catch the most common > mistake of people coming from all other languages. > > > Regards > >> On Friday, August 22, 2014 11:11:50 PM UTC+3, Douglas Bates wrote: >> In a discussion on julia-users about the formula language for >> the MixedModels package I found myself again explaining the >> implicit intercept in the R formula language and its >> ramifications. This led me to consider not having an implicit >> intercept and requiring the user to write y ~ 1 + x when they >> want an intercept and slope. >> >> >> It is two extra characters to type which is negligible. >> However, it would throw off any R users who expect the >> implicit intercept. Which is the lesser of the two evils? I >> personally think that requiring an explicit intercept term >> would make the connection between the formula and the fitted >> coefficients much clearer but that doesn't mean that there >> wouldn't be howls of protest. >> -- >> You received this message because you are subscribed to the Google >> Groups "julia-stats" group. >> To unsubscribe from this group and stop receiving emails from it, send >> an email to [hidden email]. >> For more options, visit https://groups.google.com/d/optout. > > -- > You received this message because you are subscribed to the Google Groups "julia-stats" group. > To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
In reply to this post by Stefan Karpinski
On Friday, August 22, 2014 4:54:48 PM UTC-5, Stefan Karpinski wrote:
--
R allows `y ~ x -1` for "without intercept" which has always seemed reasonable to me. I agree that `y ~ x + 0` being different from `y ~ x` is strange. FWIW I've never heard someone say, "I ran a regression of salary on education level and an intercept." But people will say it out loud if there's no intercept. Regression without an intercept is something that almost never comes up. (I know I'm repeating myself...) You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
In reply to this post by John Myles White
On 22 August 2014 15:07, John Myles White <[hidden email]> wrote: I believe not specifying anything would count as no intercept. That seems very dangerous to me. If non-experts end up using it (and maybe that's not a good thing, but it will almost certainly happen) they're going to screw it up, and present nonsense results without knowing it. Perhaps if you don't specify anything about the intercept, an error or warning should be given. But how often do you run regressions without the intercept? For me, it's extremely close to never. (Most commonly, it's to demonstrate how you get screwy results if you do it.)
Jeremy You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
I prefer to use an implicit 1. It is simply and most people are going to expect it. But also would be great to have the ability for use any numeric constant to indicate a factor in the intercept term. You can choose 0 for not allowing the intercept term. Also you can do some fancy stuff, like allowing to find factors of pi or other constants ;) y ~ x y = a.x + 1.b y ~ x + 1 y = a.x - 1.b y ~ x - 1y = a.x + 0.b y ~ x + 0 y = a.x + pi.b y ~ x + pi El viernes, 22 de agosto de 2014 19:19:39 UTC-3, Jeremy Miles escribió:
You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
Hi, Best, Bradley On Friday, August 22, 2014 5:42:34 PM UTC-5, Diego Javier Zea wrote:
You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
I’m surprised that many of the e-mails on this thread are so focused on what kinds of models are commonly fit, rather than considering what are, in my mind, the core issues of programming language design: simplicity of definition, ease of implementation and explicitness of semantics. Since we’re not trying to figure out what the Huffman code would be for writing down statistical models, it’s not clear to me that we as a community should be so concerned with the statistics of how OLS is used when the proposed burden of explicit intercepts is the addition of 4 characters worth of typing. To put it another way, we’re designing a programming language, so we should put some effort into making sure that we create a language with simple and precise semantics. After some reflection, I personally think that the use of implicit intercepts substantively worsens the formula language. In particular, I’d like to raise some concerns that I have about the use of implicit intercepts. (1) It means that the + operator doesn’t have any coherent meaning. If intercept terms are explicit, the + operator is exactly equivalent to the statement, “add a new set of columns into my design matrix.” But if you have to write y ~ 0 + x, you’ve broken that interpretation of the + operator. Adding special cases to core operators is risky because any quirk in the core primitives of a language is likely to produce crazy results when you consider more complex expressions written in that language. For example, what does y ~ (0 + x) * f mean? Just how many special cases are you really adding to the language by allowing the 0 in formulas? This problem of downstream poisoning of semantics is even worse if you consider allowing the use of -1 to reflect the absence of intercepts. If you allow people to write y ~ x - 1, you’ve added an entire operator to your language without adding any expressive power relative to the use of explicit intercepts. But you haven’t just added in functionality that’s not strictly necessary. You’ve also introduced a lot of edge cases that make the formula language harder to work with. Here are some of the problems you now need to solve: (1a) What’s the precedence of the subtraction operator? Is the precedence inherited from Julia appropriate? (1b) How does the - operator interact with things like the * operator in a formula such as y ~ (x - 1) * f? (1c) Can you use subtraction to remove terms other than intercepts? For example, could you write y ~ a*b - a&b to remove an interaction from a model with both main effects and interactions? If not, why not? (1d) Where can -1 occur? Can you write y ~ -1? Can you write y ~ -1 + x? Can you write y ~ x - 1? Are the last two of these formulas identical or not? These kinds of edge case problems need complete, formal solutions. And the solutions need to obey simple desiderata: for example, the formula language shouldn’t use + if the operator isn’t commutative. In short, the use of implicit intercepts makes the formula language substantially more complex without making it substantially more expressive. And that’s not my only concern about the implicit intercept approach. I’m also worried by the fact that: (2) The use of implicit intercepts means that the nested model relationship is no longer reflected in surface syntax. The model y ~ 1 is a subset of the model y ~ x, but the formulas are not. This makes quick visual analysis of formulas harder. (3) It makes the formula language less explicit, which means that newcomers need to memorize more rules and implicit assumptions to understand the language. (4) It attempts to retain superficial similarities to R in a system that’s got very different semantics from R. False similiarities to R are likely to make learning Julia harder because of proactive interference. A formula language that’s more obviously different from R will probably help people appreciate that Julia’s language is completely separate from R’s. (5) It introduces synonyms into the formula language in the sense that y ~ x and y ~ 1 + x are meant to be identical. This means that implementations of the formula language need to be very careful to make sure that those synonyms never behave differently. — John On Aug 24, 2014, at 9:19 AM, Bradley Setzler <[hidden email]> wrote:
You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
+1,000,000,000 A really good process for this kind of design is to try to document it. If the process of explaining something precisely and clearly is too difficult – e.g. you find yourself adding "except" and "but" everywhere – then you've got a problem. I suspect that implicit intercepts are a feature that leads to a lot of these kinds of digressions in explaining the formula language and that explicit intercepts would make it much easier to explain and only trivially more verbose to use. -- You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
In reply to this post by John Myles White
+10^(# R users)
-- The R formula language is a great interface, but the implicit intercept was a big mistake. On Sunday, August 24, 2014 8:49:39 PM UTC+3, John Myles White wrote:
You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
In reply to this post by John Myles White
There are a lot of good points in this email, but I don't think that having the explicit "1" is universally better. One could define addition and multiplication on formulas as
-- (y ~ x) + (y ~ z) = y ~ x + z (y ~ x) * (y ~ z) = y ~ x + z + x&z if the LHS variable is the same in both formulas. With an explicit constant we now get redundant terms to deal with. AFAIK, the intercept/no-intercept distinction only makes sense in a regression context, which might be why people have focused on the models that are commonly fit. But if we're thinking about other uses of formulas, it would be strange to use plot(y ~ x, exampledata) to graph a scatter plot, but have to use `y ~ 1 + x` to estimate its slope. Especially because f = y ~ x plot(f, exampledata) lm(f, exampledata) plot(lm(f, exampledata)) looks like a very nice way to plot a relationship, estimate it, and plot the estimated relationship. If plot(y ~ x + 1, exampledata) so it's consistent with the glm use, what does plot(y ~ x, exampledata) generate? Since formulas should be used in other places than estimation, but the distinction probably only makes sense in regression, another approach might be to make `nointercept` a separate argument for glm and depreciate using +0. Then `y ~` could be a valid but discouraged way of writing what's now `y ~ 1`. --Gray On Sunday, August 24, 2014 12:49:39 PM UTC-5, John Myles White wrote:
You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
Sorry, this should be "If we require users to write... plot(y ~ x + 1, exampledata)..." You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
In reply to this post by Gray Calhoun
To be honest, I’m not at all concerned about that issue, since I am steadfastly opposed to the idea of using formulas for anything other than the specification of design matrices.
That tradition in R violates what is arguably the most central concept in Julia style: don’t write puns. We should never reuse operators to mean things that have no relationship with their core semantics. Also, just to resolve one potential source of disagreement: everyone here understands that Julia is eagerly evaluated, right? I ask because the absence of delayed evaluation means that almost all R idioms involving the ~ tilde operator don’t make any sense in Julia and will never work in Julia. — John On Aug 24, 2014, at 11:37 AM, Gray Calhoun <[hidden email]> wrote:
You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
Free forum by Nabble | Edit this page |