I opened https://github.com/JuliaStats/DataFrames.jl/issues/1012 because the ModelFrame/ModelMatrix code badly needs refactoring. It was written early in the development of Julia when I was still thinking R while writing Julia.
-- Part of the motivation is to fix problems with ModelMatrix in Julia v0.5.0-dev using the new DataFrames forumlation using NullableArrays and CategoricalArrays. One issue I encountered is expanding columns from terms involivng a CategoricalArray or a PooledDataArray. If you have a NominalArray in a model with an intercept, that term should generate k - 1 columns in the model matrix. In R the reduced set of columns are called the contrasts. Some will argue with that name (technically contrasts columns are defined as being orthogonal to a constant column but that is no longer important). One way of generating contrasts is first to generate the matrix of indicators then generate the desired contrasts. Sometimes it is simpler to generate the contrasts matrix directly. Contrasts can be defined by a k by k-1 matrix. The default in R for nominal arrays are the "treatment contrasts". The matrix defining these is obtained by dropping the first column of an identity of size k. To reproduce the parameterization used in SAS the last column of the identity is dropped. For ordinal arrays polynomial contrasts are sometimes used. Currently there is an indicatormat generic in StatsBase that creates a Matrix{Bool}, either sparse or dense, that is the transpose of the matrix of indicator columns. That is, it is the Matrix{Bool} of indicator rows, not columns.
I suggest that indicatormat methods be defined for PooledDataArray and CategoricalArray types too. Regarding contrasts, I think the contrasts generic should also be defined in StatsBase. Methods would be defined in packages like DataArrays and CategoricalArrays because they depend on the internal representation of the array type. The primary method would be like
Contrast types can be expressed as functions that map k to a k - 1 x k matrix.
You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
Have you looked at https://github.com/JuliaStats/DataFrames.jl/pull/870? Many (but not all, I think) of these ideas are incorporated there.
-- On Wednesday, July 13, 2016 at 12:48:55 PM UTC-4, Douglas Bates wrote:
You received this message because you are subscribed to the Google Groups "julia-stats" group. To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email]. For more options, visit https://groups.google.com/d/optout. |
Free forum by Nabble | Edit this page |