trimmean() is biased / aka removes values unevenly

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

trimmean() is biased / aka removes values unevenly

Daniel Carrera
Hello,

I was looking through the source code of trimmean() and I just realized that in general it does not remove data evenly from the top and bottom. Here is the source:


"""
    trimmean(x, p)

Compute the trimmed mean of `x`, i.e. the mean after removing a
proportion `p` of its highest- and lowest-valued elements.
"""
function trimmean(x::RealArray, p::Real)
    n = length(x)
    n > 0 || error("x can not be empty.")
    0 <= p < 1 || error("p must be non-negative and less than 1.")
    rn = min(round(Int, n * p), n-1)

    sx = sort(x)
    nl = rn >> 1
    nh = (rn - nl)
    s = 0.0
    for i = (1+nl) : (n-nh)
        @inbounds s += sx[i]
    end
    return s / (n - rn)
end


So this removes `nl` elements from the bottom and `nh` elements from the top. Some times these are the same number, and some times `nh` is one higher. This means that some times trimmean() removes values unevenly. This is not how I have seen the trimmed mean defined. Every source that I know says that the trimmed mean removes the same number of elements from the top and bottom. For example, Wilcox (2010) says: "More generally, if we round [p * n] down to the nearest integer g, remove the g smallest and largest values and average the n - 2g values that remain". This distinction is not irrelevant. There are theorems about how to compute the variance and confidence intervals for the trimmed mean that rely on one particular definition of the trimmed mean. If you change the definition, I can no longer compute a confidence interval for the computed value.

Another difference between the trimmean() function and the usual definition is that the "p% trimmed mean" should mean that you remove p% from the top and p% from the bottom. Whereas in the trimmean() function it means that you remove (p/2)% from the top and (p/2)% from the bottom.


Is there any chance that the definition of trimmean() could be changed in a future release to agree with Wilcox (2010) and other texts?


Cheers,
Daniel.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: trimmean() is biased / aka removes values unevenly

Milan Bouchet-Valat
Le mercredi 27 juillet 2016 à 13:18 -0700, Daniel Carrera a écrit :

> Hello,
>
> I was looking through the source code of trimmean() and I just
> realized that in general it does not remove data evenly from the top
> and bottom. Here is the source:
>
>
> """
>     trimmean(x, p)
>
> Compute the trimmed mean of `x`, i.e. the mean after removing a
> proportion `p` of its highest- and lowest-valued elements.
> """
> function trimmean(x::RealArray, p::Real)
>     n = length(x)
>     n > 0 || error("x can not be empty.")
>     0 <= p < 1 || error("p must be non-negative and less than 1.")
>     rn = min(round(Int, n * p), n-1)
>
>     sx = sort(x)
>     nl = rn >> 1
>     nh = (rn - nl)
>     s = 0.0
>     for i = (1+nl) : (n-nh)
>         @inbounds s += sx[i]
>     end
>     return s / (n - rn)
> end
>
>
> So this removes `nl` elements from the bottom and `nh` elements from
> the top. Some times these are the same number, and some times `nh` is
> one higher. This means that some times trimmean() removes values
> unevenly. This is not how I have seen the trimmed mean defined. Every
> source that I know says that the trimmed mean removes the same number
> of elements from the top and bottom. For example, Wilcox (2010) says:
> "More generally, if we round [p * n] down to the nearest integer g,
> remove the g smallest and largest values and average the n - 2g
> values that remain". This distinction is not irrelevant. There are
> theorems about how to compute the variance and confidence intervals
> for the trimmed mean that rely on one particular definition of the
> trimmed mean. If you change the definition, I can no longer compute a
> confidence interval for the computed value.
>
> Another difference between the trimmean() function and the usual
> definition is that the "p% trimmed mean" should mean that you remove
> p% from the top and p% from the bottom. Whereas in the trimmean()
> function it means that you remove (p/2)% from the top and (p/2)% from
> the bottom.
>
>
> Is there any chance that the definition of trimmean() could be
> changed in a future release to agree with Wilcox (2010) and other
> texts?
I guess so, in particular if you confirm that other major software
behaves that way, and even more so if you make a PR.


Regards

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: trimmean() is biased / aka removes values unevenly

Daniel Carrera

On 27 July 2016 at 22:42, Milan Bouchet-Valat <[hidden email]> wrote:
>
> Is there any chance that the definition of trimmean() could be
> changed in a future release to agree with Wilcox (2010) and other
> texts?
I guess so, in particular if you confirm that other major software
behaves that way, and even more so if you make a PR.


Regards


Thanks. I'll go find out what other software does.

Cheers,
Daniel.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: trimmean() is biased / aka removes values unevenly

Daniel Carrera
In reply to this post by Milan Bouchet-Valat
On 27 July 2016 at 22:42, Milan Bouchet-Valat <[hidden email]> wrote:
> Is there any chance that the definition of trimmean() could be
> changed in a future release to agree with Wilcox (2010) and other
> texts?
I guess so, in particular if you confirm that other major software
behaves that way, and even more so if you make a PR.


Ok. I've done that now. I can confirm that R does the trimmed mean the way I described:

> help(mean)
...
    trim: the fraction (0 to 0.5) of observations to be trimmed from
          each end of ‘x’ before the mean is computed.  Values of trim
          outside that range are taken as the nearest endpoint.
...
> x = c(1, 10,20,30,40,50,60,70,80, 10000)
> mean(x, trim=0.25)
[1] 45
> mean(x, trim=0.05)
[1] 1036.1


Compared with:

julia> using StatsBase
julia> x = [1, 10,20,30,40,50,60,70,80, 10000]
julia> trimmean(x,0.5)
40.0
julia> trimmean(x,0.1)
40.111111111111114


I also made a PR that makes trimmean() behave the same way as the R implementation, and the way Wilcox (2010) says it should behave. I don't have a lot of experience making PRs. I hope I got it right:




Cheers,
Daniel.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Loading...