fixed effects

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

fixed effects

Matthieu
Hello,

I'm just starting with Julia today and I've coded a simple algorithm to demean columns of a data.frame with respect to multiple high dimensional fixed effects (where groups are potentially defined by multiple columns).


The algorithm simply returns a new dataframe with the partialled out columns. One may use these partialled out variables in a simple OLS. This corresponds to a very basic version of the R package felm.

I've tried to minimize copies (using subdataframes) and to return residuals aligned with the original data.frame (in case of NA).

Since this is my first experience with Julia, I'd welcome any kind of feedback.

A couple of  beginner questions:

- Is `copy` the best way to keep the previous result in [this iteration loop](https://github.com/matthieugomez/FixedEffects.jl/blob/master/src/fixedeffects.jl#L60)?
- What's the best way to add a subset argument to my function ? I'd like this argument to allow the user to (estimate the model and return the residuals) on a subset of the dataframe only. 





--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: fixed effects

Patrick Kofod Mogensen
Just an FYI. When you export the demean-function, it enters the namespace when you write using. This means that you can just write demean(...); no need to write Package.Function(args...).

I take it you are an econometrics-student of some sort, feel free to hit me up if you have any future projects you need help with. Stuff like this is nice to have, if we want julia to enter exercise classes at universities.

Good to have you on board!
Patrick

On Wednesday, June 10, 2015 at 12:35:53 AM UTC+2, Matthieu wrote:
Hello,

I'm just starting with Julia today and I've coded a simple algorithm to demean columns of a data.frame with respect to multiple high dimensional fixed effects (where groups are potentially defined by multiple columns).

<a href="https://github.com/matthieugomez/FixedEffects.jl" target="_blank" rel="nofollow" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Fmatthieugomez%2FFixedEffects.jl\46sa\75D\46sntz\0751\46usg\75AFQjCNFtKwnVPUt-yo6Vk84_yFcFN4EgKg';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Fmatthieugomez%2FFixedEffects.jl\46sa\75D\46sntz\0751\46usg\75AFQjCNFtKwnVPUt-yo6Vk84_yFcFN4EgKg';return true;">FixedEffects.jl

The algorithm simply returns a new dataframe with the partialled out columns. One may use these partialled out variables in a simple OLS. This corresponds to a very basic version of the R package felm.

I've tried to minimize copies (using subdataframes) and to return residuals aligned with the original data.frame (in case of NA).

Since this is my first experience with Julia, I'd welcome any kind of feedback.

A couple of  beginner questions:

- Is `copy` the best way to keep the previous result in [this iteration loop](<a href="https://github.com/matthieugomez/FixedEffects.jl/blob/master/src/fixedeffects.jl#L60" target="_blank" rel="nofollow" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Fmatthieugomez%2FFixedEffects.jl%2Fblob%2Fmaster%2Fsrc%2Ffixedeffects.jl%23L60\46sa\75D\46sntz\0751\46usg\75AFQjCNHsJm-8hgy0qMSxrEieGT6WrYUYaA';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Fmatthieugomez%2FFixedEffects.jl%2Fblob%2Fmaster%2Fsrc%2Ffixedeffects.jl%23L60\46sa\75D\46sntz\0751\46usg\75AFQjCNHsJm-8hgy0qMSxrEieGT6WrYUYaA';return true;">https://github.com/matthieugomez/FixedEffects.jl/blob/master/src/fixedeffects.jl#L60)?
- What's the best way to add a subset argument to my function ? I'd like this argument to allow the user to (estimate the model and return the residuals) on a subset of the dataframe only. 





--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: fixed effects

Matthieu
In reply to this post by Matthieu
Thanks. 

The current version of the package now estimates models with instrumental variables (2SLS), high dimensional fixed effects, and white / clustered standard errors. This allows to estimate a large part of models used in applied economics research. Moreover, this function seems faster than Stata and R corresponding functions (respectively areg / lfe), in particular for models with one high dimensional fixed effect.

Two more points make this function differ from the lm function in GLM:

1. The regression result object is very light (basically the initial formula, a vector of coefficients, and a covariance matrix). In contrast, since the output of GLM contains the original dataframe, the converted matrix of regressors, the model response etc,   the output from GLM can actually take much more space than the initial DataFrame.
I have chosen to return a light object because it allows to estimate multiple models without requiring more RAM at every step. Methods such as predict and residual can be defined as long as the user provides a DataFrame

2. The function has an argument that allows to change the way errors are computed. In R, correct errors are generally estimated in a second step, through a different package like vcov, multiwayvcov. This strikes me as inefficient and counterintuitive.

I've defined an abstract type AbstractVcov. Any user can define a new type (child of this abstract type), as long as he/she defines a method, vcov, that acts on a regressor matrix (X), a hat matrix (X'X in the simple case), and a vector of residuals. This seems enough to define a wide range of standard errors.

I've only defined 3 types (simple, white, clustered). 
For instance, to estimate a model with white robust standard errors
reg(formula, df, VceWhite())

To estimate a model with clustered standard errors
reg(formula, df, VceCluster(:clustervar))




--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: fixed effects

Milan Bouchet-Valat
Le mercredi 24 juin 2015 à 09:25 -0700, Matthieu a écrit :
> Thanks.
>
> The current version of the package now estimates models with
> instrumental variables (2SLS), high dimensional fixed effects, and
> white / clustered standard errors. This allows to estimate a large
> part of models used in applied economics research. Moreover, this
> function seems faster than Stata and R corresponding functions
> (respectively areg / lfe), in particular for models with one high
> dimensional fixed effect.
I'm not very familiar with these models, but that looks really nice.
Have you considered using the fit() function with a model type to be
more similar to GLM.jl?

> Two more points make this function differ from the lm function in
> GLM:
>
> 1. The regression result object is very light (basically the initial
> formula, a vector of coefficients, and a covariance matrix). In
> contrast, since the output of GLM contains the original dataframe,
> the converted matrix of regressors, the model response etc,   the
> output from GLM can actually take much more space than the initial
> DataFrame.
> I have chosen to return a light object because it allows to estimate
> multiple models without requiring more RAM at every step. Methods
> such as predict and residual can be defined as long as the user
> provides a DataFrame
I agree that's likely a good idea. With data sources like databases, it
wouldn't make any sense to try saving all of the data with the model.
We could imagine adding an argument to keep a copy of the data, if it
turns out that's needed.

I think the only case where having the data in the model object is when
calling predict(). Maybe it would be possible to save just the name of
the data frame, and use it if it's in scope?

> 2. The function has an argument that allows to change the way errors
> are computed. In R, correct errors are generally estimated in a
> second step, through a different package like vcov, multiwayvcov.
> This strikes me as inefficient and counterintuitive.
>
> I've defined an abstract type AbstractVcov. Any user can define a new
> type (child of this abstract type), as long as he/she defines a
> method, vcov, that acts on a regressor matrix (X), a hat matrix (X'X
> in the simple case), and a vector of residuals. This seems enough to
> define a wide range of standard errors.
>
> I've only defined 3 types (simple, white, clustered).
> For instance, to estimate a model with white robust standard errors
> reg(formula, df, VceWhite())
>
> To estimate a model with clustered standard errors
> reg(formula, df, VceCluster(:clustervar))
Sounds cool. I had open an issue in GLM.jl about this:
https://github.com/JuliaStats/GLM.jl/issues/42

Do you have any ideas about how to handle bootstrap in the same
framework?


Regards

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: fixed effects

Matthieu
Thanks!. I'm glad you also think standard errors should be an argument in the fit option!
I have considered using the fit function, but I don't really understand what the first argument is supposed to be : the syntax is very different between, say, GLM, MixedModels, and NLreg (https://github.com/JuliaStats/StatsBase.jl/issues/116).


On Wed, Jun 24, 2015 at 1:19 PM, Milan Bouchet-Valat <[hidden email]> wrote:
Le mercredi 24 juin 2015 à 09:25 -0700, Matthieu a écrit :
> Thanks.
>
> The current version of the package now estimates models with
> instrumental variables (2SLS), high dimensional fixed effects, and
> white / clustered standard errors. This allows to estimate a large
> part of models used in applied economics research. Moreover, this
> function seems faster than Stata and R corresponding functions
> (respectively areg / lfe), in particular for models with one high
> dimensional fixed effect.
I'm not very familiar with these models, but that looks really nice.
Have you considered using the fit() function with a model type to be
more similar to GLM.jl?

> Two more points make this function differ from the lm function in
> GLM:
>
> 1. The regression result object is very light (basically the initial
> formula, a vector of coefficients, and a covariance matrix). In
> contrast, since the output of GLM contains the original dataframe,
> the converted matrix of regressors, the model response etc,   the
> output from GLM can actually take much more space than the initial
> DataFrame.
> I have chosen to return a light object because it allows to estimate
> multiple models without requiring more RAM at every step. Methods
> such as predict and residual can be defined as long as the user
> provides a DataFrame
I agree that's likely a good idea. With data sources like databases, it
wouldn't make any sense to try saving all of the data with the model.
We could imagine adding an argument to keep a copy of the data, if it
turns out that's needed.

I think the only case where having the data in the model object is when
calling predict(). Maybe it would be possible to save just the name of
the data frame, and use it if it's in scope?

> 2. The function has an argument that allows to change the way errors
> are computed. In R, correct errors are generally estimated in a
> second step, through a different package like vcov, multiwayvcov.
> This strikes me as inefficient and counterintuitive.
>
> I've defined an abstract type AbstractVcov. Any user can define a new
> type (child of this abstract type), as long as he/she defines a
> method, vcov, that acts on a regressor matrix (X), a hat matrix (X'X
> in the simple case), and a vector of residuals. This seems enough to
> define a wide range of standard errors.
>
> I've only defined 3 types (simple, white, clustered).
> For instance, to estimate a model with white robust standard errors
> reg(formula, df, VceWhite())
>
> To estimate a model with clustered standard errors
> reg(formula, df, VceCluster(:clustervar))
Sounds cool. I had open an issue in GLM.jl about this:
https://github.com/JuliaStats/GLM.jl/issues/42

Do you have any ideas about how to handle bootstrap in the same
framework?


Regards

--
You received this message because you are subscribed to a topic in the Google Groups "julia-stats" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/julia-stats/PvAs1MceAnc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: fixed effects

Patrick Kofod Mogensen
In reply to this post by Matthieu
I'll have a look at the updates later. Would you be against having the std. errors as a keyword instead, with some default (sandwich, or whatever)? The "stardard" way (or at least how a lot of people seem to be doing it) is to have

reg(formula, df; se = :sandwich)

so you would run
reg( y ~ x + z, df)

for default, and 

reg(y ~  x + z; se = :my_custom_se)

for some other standard error-method. You would have to do the clustering a bit different, but I think you get the idea. You can see Optim.jl or QuantileRegression.jl to see what I mean (they have "method" keywords).

On Wednesday, June 24, 2015 at 6:25:42 PM UTC+2, Matthieu wrote:
Thanks. 

The current version of the package now estimates models with instrumental variables (2SLS), high dimensional fixed effects, and white / clustered standard errors. This allows to estimate a large part of models used in applied economics research. Moreover, this function seems faster than Stata and R corresponding functions (respectively areg / lfe), in particular for models with one high dimensional fixed effect.

Two more points make this function differ from the lm function in GLM:

1. The regression result object is very light (basically the initial formula, a vector of coefficients, and a covariance matrix). In contrast, since the output of GLM contains the original dataframe, the converted matrix of regressors, the model response etc,   the output from GLM can actually take much more space than the initial DataFrame.
I have chosen to return a light object because it allows to estimate multiple models without requiring more RAM at every step. Methods such as predict and residual can be defined as long as the user provides a DataFrame

2. The function has an argument that allows to change the way errors are computed. In R, correct errors are generally estimated in a second step, through a different package like vcov, multiwayvcov. This strikes me as inefficient and counterintuitive.

I've defined an abstract type AbstractVcov. Any user can define a new type (child of this abstract type), as long as he/she defines a method, vcov, that acts on a regressor matrix (X), a hat matrix (X'X in the simple case), and a vector of residuals. This seems enough to define a wide range of standard errors.

I've only defined 3 types (simple, white, clustered). 
For instance, to estimate a model with white robust standard errors
reg(formula, df, VceWhite())

To estimate a model with clustered standard errors
reg(formula, df, VceCluster(:clustervar))




--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: fixed effects

Patrick Kofod Mogensen
Ah, now I see what you did (looked in the repo), the VceWhite() is a constructor, I thought it was the vcov-function for White standard errors :)

On Friday, June 26, 2015 at 11:55:59 AM UTC+2, Patrick Kofod Mogensen wrote:
I'll have a look at the updates later. Would you be against having the std. errors as a keyword instead, with some default (sandwich, or whatever)? The "stardard" way (or at least how a lot of people seem to be doing it) is to have

reg(formula, df; se = :sandwich)

so you would run
reg( y ~ x + z, df)

for default, and 

reg(y ~  x + z; se = :my_custom_se)

for some other standard error-method. You would have to do the clustering a bit different, but I think you get the idea. You can see Optim.jl or QuantileRegression.jl to see what I mean (they have "method" keywords).

On Wednesday, June 24, 2015 at 6:25:42 PM UTC+2, Matthieu wrote:
Thanks. 

The current version of the package now estimates models with instrumental variables (2SLS), high dimensional fixed effects, and white / clustered standard errors. This allows to estimate a large part of models used in applied economics research. Moreover, this function seems faster than Stata and R corresponding functions (respectively areg / lfe), in particular for models with one high dimensional fixed effect.

Two more points make this function differ from the lm function in GLM:

1. The regression result object is very light (basically the initial formula, a vector of coefficients, and a covariance matrix). In contrast, since the output of GLM contains the original dataframe, the converted matrix of regressors, the model response etc,   the output from GLM can actually take much more space than the initial DataFrame.
I have chosen to return a light object because it allows to estimate multiple models without requiring more RAM at every step. Methods such as predict and residual can be defined as long as the user provides a DataFrame

2. The function has an argument that allows to change the way errors are computed. In R, correct errors are generally estimated in a second step, through a different package like vcov, multiwayvcov. This strikes me as inefficient and counterintuitive.

I've defined an abstract type AbstractVcov. Any user can define a new type (child of this abstract type), as long as he/she defines a method, vcov, that acts on a regressor matrix (X), a hat matrix (X'X in the simple case), and a vector of residuals. This seems enough to define a wide range of standard errors.

I've only defined 3 types (simple, white, clustered). 
For instance, to estimate a model with white robust standard errors
reg(formula, df, VceWhite())

To estimate a model with clustered standard errors
reg(formula, df, VceCluster(:clustervar))




--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.