Help me understand kde_lscv() in KernelDensity.jl?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Help me understand kde_lscv() in KernelDensity.jl?

Daniel Carrera
Hello,

This week I've begun learning about non-parametric statistics and I'm interested in the kernel density estimation, which is implemented in KernelDensity.jl.

Could someone help me understand how kde_lscv() differs from kde()? The documentation says it selects the bandwidth by "least squares cross validation". What does that mean? What are the advantages? As far as I can figure out, LSCV means that it tries to minimize the mean-squared error (MSE) and that's better because the regular kde() function uses a bandwidth estimation (Silverman's rule) that is designed for Gaussian data. Have I understood things correctly?

In general, should I worry about using kde() instead of kde_lscv() if I don't know ahead of time that my data is Gaussian? Or is kde() a good default?

Cheers,
Daniel.


--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Help me understand kde_lscv() in KernelDensity.jl?

Simon Byrne
Yes, you seem to have it more or less correct. Both are heuristics: Silverman's rule is a fairly simple one, justified by the derivation from a Gaussian distribution; LSCV is more advanced, justified by a cross-validation argument. 

As with all heuristics, there will be occasions where each of them will break, but generally I would probably lean toward kde_lscv, although they shouldn't give hugely different results.

-Simon


On Wednesday, 20 January 2016 08:29:42 UTC, Daniel Carrera wrote:
Hello,

This week I've begun learning about non-parametric statistics and I'm interested in the kernel density estimation, which is implemented in KernelDensity.jl.

Could someone help me understand how kde_lscv() differs from kde()? The documentation says it selects the bandwidth by "least squares cross validation". What does that mean? What are the advantages? As far as I can figure out, LSCV means that it tries to minimize the mean-squared error (MSE) and that's better because the regular kde() function uses a bandwidth estimation (Silverman's rule) that is designed for Gaussian data. Have I understood things correctly?

In general, should I worry about using kde() instead of kde_lscv() if I don't know ahead of time that my data is Gaussian? Or is kde() a good default?

Cheers,
Daniel.


--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.