How to run a log-log linear regression in Julia?
-- Like "lm(log (y) ~ log (x)" in R You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
Le dimanche 21 février 2016 à 15:54 -0800, Mario Henrique a écrit :
> How to run a log-log linear regression in Julia? > Like "lm(log (y) ~ log (x)" in R AFAIK this is perfectly equivalent to taking the log of both x and y, and applying a linear regression on the resulting variables. So that should be quite straightforward with GLM.jl (see its documentation). Regards -- You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
The documentation is not very explicit about the preferred way to do something like that. It looks to me as if you have to put the variables in a DataFrame, update the DataFrame with log versions, then do the lm.
--
Is that the preferred method? Den mandag den 22. februar 2016 kl. 10.45.25 UTC+1 skrev Milan Bouchet-Valat: Le dimanche 21 février 2016 à 15:54 -0800, Mario Henrique a écrit : You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
Yes, i have to put log in data frame firt, take "lm(log(y) ~ log(x), data)" i get a error. Thanks for help.2016-02-22 8:30 GMT-03:00 Michael Borregaard <[hidden email]>:
-- Mario Henrique You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
Le lundi 22 février 2016 à 08:48 -0300, Mario Silveira a écrit :
> Yes, i have to put log in data frame firt, take "lm(log(y) ~ log(x), > data)" i get a error. Yes, sorry I wasn't clear (I thought your question was about statistical theory). Adding transformed variables to the data frame is the preferred method AFAIK. And it's not too annoying to do either. That said, transformation inside formulas will likely be supported at some point. See this old issue : https://github.com/JuliaStats/DataFrames.jl/issues/19 Regards > Thanks for help. > > 2016-02-22 8:30 GMT-03:00 Michael Borregaard <[hidden email]> > : > > The documentation is not very explicit about the preferred way to > > do something like that. It looks to me as if you have to put the > > variables in a DataFrame, update the DataFrame with log versions, > > then do the lm. > > using GLM, DataFrames > > x = collect(1:5) + rand(5) > > y = collect(1:5) + rand(5) > > test = DataFrame(x = x, y = y) > > test[:logx] = log(test[:x]) > > test[:logy] = log(test[:y]) > > lm(logy ~ logx, test) > > > > Is that the preferred method? > > > > > > > Le dimanche 21 février 2016 à 15:54 -0800, Mario Henrique a > > > écrit : > > > > How to run a log-log linear regression in Julia? > > > > Like "lm(log (y) ~ log (x)" in R > > > AFAIK this is perfectly equivalent to taking the log of both x > > > and y, > > > and applying a linear regression on the resulting variables. So > > > that > > > should be quite straightforward with GLM.jl (see its > > > documentation). > > > > > > > > > Regards > > -- > > You received this message because you are subscribed to a topic in > > the Google Groups "julia-stats" group. > > To unsubscribe from this topic, visit https://groups.google.com/d/t > > opic/julia-stats/wGH77VmDQDc/unsubscribe. > > To unsubscribe from this group and all its topics, send an email to > > [hidden email]. > > For more options, visit https://groups.google.com/d/optout. > > > > -- You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
Thanks for the clarification! And great news on the potential for functions to be included in formulas. It is not too annoying to add the variables, but it will help to create a more fluid and natural analytical workflow I feel. On Mon, Feb 22, 2016 at 1:53 PM, Milan Bouchet-Valat <[hidden email]> wrote: Le lundi 22 février 2016 à 08:48 -0300, Mario Silveira a écrit : You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
Yes Michael. I'm economist and most softwares I know for econometrics have to add log variables firt to do regression. There is no problem in do it for me. Thanks for help, Julia is great!!2016-02-22 10:07 GMT-03:00 Michael Krabbe Borregaard <[hidden email]>:
-- Mario Henrique You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
This page seems relevant:
--
HTH, Cédric On Monday, February 22, 2016 at 8:59:29 AM UTC-5, Mario Henrique wrote:
You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
Cédric, Thanks for atention Cédric.in economics log-log models are widely used for various reasons, the main one is that it allows you to interpret the model in elasticity term. Log-log with OLS appears in the majority of econometrics books, has been shown, empirical and mathematically be useful when used with the right premises. The author must be right that log-log has its problems, but I think it is wrong to generalize it, at least he should present some quotes with proof and evidence, if unlike just merely an allegation without scientific value. 2016-03-01 22:39 GMT-03:00 Cedric St-Jean <[hidden email]>:
-- Mario Henrique You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
Cedric's point isn't that you can't have models that appear linear on a log-log plot – power laws are precisely this kind of model. The point is that you should not use linear regression on the log-transformed data to estimate the model parameters. If that's what's done in econometrics books, they may need revision. On Tue, Mar 1, 2016 at 9:17 PM, Mario Silveira <[hidden email]> wrote:
You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
Econometrics eventually "developing independently" of the statistical form. The reason is the type of data we deal. In a basic course in statistics, linear regression takes only a few chapters of the book. In a first course of econometrics, the whole course is about regression models, when the estimators are biased, multicollinearity, heteroscedasticity, static or dynamic interpretation ... If you interest in see a litle of how econometric works, an sugestion is Econometrics Analysis by Greene.I believe that no other area of study is so much concerned with the accuracy of regression as in econometrics. Ps: in econometrics, statistical estimation happens in a second stage, first the model must be validated theoretically. Of course, there are several problems, but the log-log regression to elasticity is one of the most basic and reliable tools used every day worldwide. 2016-03-01 23:26 GMT-03:00 Stefan Karpinski <[hidden email]>:
-- Mario Henrique You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
Maybe I did not express myself properly. Here an example of what I meant http://www.dummies.com/how-to/content/econometrics-and-the-loglog-model.html 2016-03-01 23:49 GMT-03:00 Mario Silveira <[hidden email]>:
-- Mario Henrique You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
I think the key claim here is: You can estimate this model with OLS by simply using natural log values for the variables instead of their original scale. There are people far more qualified than I am on this list confirm (or deny) this, but statistical best-practice seem to be that this is not a good way to estimate the model parameters. The attached paper gives a nice overview – I found it very informative in any case. They suggest using equation (5) to estimate the exponent of a power law. The preceding paragraph explains why using OLS on the log-transformed values is a bad idea, giving an example with synthetic data where that method gives a confidence interval that fails to contain the true parameter value. Using equation (5), on the other hand, gives a confidence interval perfectly centered on the true value. On Tue, Mar 1, 2016 at 9:58 PM, Mario Silveira <[hidden email]> wrote:
You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. 0412004v3.pdf (2M) Download Attachment |
Thanks Stefan for the article. I'll study it calmly and see how I can apply it. It is always good to have feedback from people from other fields. 2016-03-02 1:20 GMT-03:00 Stefan Karpinski <[hidden email]>:
-- Mario Henrique You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
In reply to this post by Mario Silveira
There is some confusion here. The context for the blog post is not really explained. The kind of log-log plots that the blog is talking about are very different from the log-log regressions made in econometrics. In the blog post, the author considers plots of P(X>x) against x with log scales on both axes, i.e. a single variable. Power laws would give a straight downward-sloping line. By construction, the dots will cross the y-axis at one regardless of the distribution that has created the data, but a regression line fitted to the data might not do that which is why the author of the blog post writes that "It [the least squares fit of the log transformed data] generally doesn't even give you a probability distribution". The model often used in econometrics, and probably many other places, relates **two** or more log transformed variables. As Mario points out, the regression coefficients are then interpreted as elasticities. There is probably also bad things to say about this, but that would be something different than what the blog post is criticizing. Economists are usually not that interested in power laws. The main exception is that the Pareto distribution is often used for modellng the upper tail in the income distribution. By the way, the blog seems quite good if you are interested in statistics. I didn't know about it so thanks for the link. On Tue, Mar 1, 2016 at 9:58 PM, Mario Silveira <[hidden email]> wrote:
You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
You got the point Andreas. Thank you! This group is wonderful.2016-03-02 1:35 GMT-03:00 Andreas Noack <[hidden email]>:
-- Mario Henrique You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
Thanks Andreas, you're right, I confused "power law" and "power law distribution".
-- On Tuesday, March 1, 2016 at 11:41:09 PM UTC-5, Mario Henrique wrote:
You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
Glad we resolve that. Andreas' explanation makes sense. On Wed, Mar 2, 2016 at 10:53 AM, Cedric St-Jean <[hidden email]> wrote:
You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
In reply to this post by Andreas Noack
On Tuesday, March 1, 2016 at 11:35:22 PM UTC-5, Andreas Noack wrote:
--
...
I would like to take a crack at this. I guess it's a bit of a hobby horse. I don't know enough economics to criticize the economics scenario, but I have seen a lot of questionable log-transformed least-squares in the hard sciences. The typical scenario that justifies standard least squares is a model that looks like this: (1) y_i = f(x_i; a) + σ ϵ_i where f is some deterministic model function, x_i and y_i are the independent and dependent data variables, a is a free parameter or a whole collection of free parameters, σ is the standard deviation of the error (which is typically unknown or uncertain), and the ϵ_i are independent samples from a standard normal distribution. The known (standard normal) joint distribution of the ϵ_i justifies taking the likelihood in terms of y_i - f(x_i; a) to be multivariate normal, which in turn justifies least squares as maximum likelihood. This story is told in various notations at the beginning of essentially every treatment of maximum likelihood. If your error is multiplicative and log-normally distributed instead of additive and normally distributed, then the model instead looks like (2) y_i = f(x_i; a)exp(σ ϵ_i) where all the symbols have exactly the same meaning as in (1) (note that exponentiating a normally distributed variable gets you a log-normally distributed variable). Then, taking logs gets you back to something that looks like (1) in terms of transformed variables log(y_i) = log(f(x_i; a)) + σ ϵ_i which justifies log-transformed least squares as maximum likelihood in the same way as before. This is fine in theory, but there are a couple problems in practice: 1. People very frequently decide to do log-transformed least-squares based on the algebraic form of f: if f is exponential or a power law, the log transformation turns a non-linear least-squares problem into linear least squares. Linear least squares is easier to execute, so that's what people frequently do. But the algebraic form of f is a totally independent issue from the question of whether the errors are additive or multiplicative (or enters in some even more complicated way). Therefore, the algebraic form of f is totally independent from the statistical justification for log-transformed least-squares, contrary to folk lore and popular practice. 2. Additive noise of some kind almost always exists in real measurements, even if there is *also* multiplicative noise. If you put something through an electronic circuit, you'll end up with at least some additive Johnson noise. Additionally, there is very commonly an uncertain additive background of some kind. So even if multiplicative noise is the dominant effect, more realistic models look like (3) y_i = f(x_i; a)exp(σ ϵ_i) + b + ω δ_i where b is an uncertain additive background, ω is the standard deviation of the additive noise, and δ_i represents independent samples from a standard normal distribution, just like ϵ_i. If you ignore the additive noise, and try to account for the additive background by subtracting off an uncertain estimate of it, and then perform log-transformed least-squares, you end up with big problems if you have any data where y_i is small compared to the uncertainty in b (the background), or compared to ω (the size of the additive noise). Poorly accounted-for additive effects might make some of your data negative (maybe only after background subtraction), which makes the log-transformed procedure totally blow up. You sometimes see people try to fix this up by clamping the data to be above some very small positive value. Even when nothing ends up negative, very small y_i often end up with very large relative error. Because log-transformed least-squares essentially assumes constant relative error, your whole fit may end up being dominated by very small data values and their anomalously large relative error. Doing standard least-squares when the error is actually multiplicative is often less bad than doing log-transformed least squares when the error is actually additive, because it's often preferable to have your fit dominated by large values and their (possibly) anomalously large absolute error than it is to have your fit dominated by small values and their (possibly) anomalously large relative error. You usually want to err on the side of accurately modeling the part of your data that is not very close to zero. But it's also possible to turn the maximum-likelihood crank on the full model (3), and with the help of software, I don't think this actually has to be so much more onerous than any other regression procedure. I'm not sure this sketch is enough to convince anyone who doesn't already know about all of this, and I also think that various aspects of it don't apply directly to the economics scenario. But I wanted to mention it since log-transformed least-squares does run into big problems even in the variable-response scenario, albeit somewhat different problems from the ones covered by Shalizi and Newman for the distribution-fitting scenario. You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
Thanks for writing that up. I really enjoyed reading it. Would actually make a rather nice blog post. On Fri, Mar 4, 2016 at 11:23 AM, Jason Merrill <[hidden email]> wrote:
You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
Free forum by Nabble | Edit this page |