Analysis of four-dimensional datasets

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Analysis of four-dimensional datasets

Julia Stats mailing list
Hi all,

I started working with Julia when I started my PhD / physics and I wrote quite a bit for my research. However, not being educated in programming, I currently reach some limits; performance wise and my knowledge of existing algorithms is limited as well. Now I am here to ask for (general) help.

  1. I am working in physics, scattering techniques. Therefore I work a lot with four dimensional datasets (reciprocal space and energy, (H,K,L,E)), which tend to be huge (several gigs).
    I found it quite easy to work with DataFrames here.
  2. From a 4d dataset, I need to reduce dimensions.
    Example: say H,E gets binned, and K and L are integrated and normalized and the statistical error is calculated. The result would be sth like a 2D dataset with axis along H and E.  H = [-2,0.1,2]; -0.1 < K < 0.1; 1 < L < 4; E = [0,1,45]
    I need to find the optimal region to present data, searching in a 4d dataset is hugely inconvenient. So far I used a DataFrame's join() on several fields, which creates a small set of duplicate / similar datapoints, which are then combined. This works on small datasets, but as you compare each point against a huge set of points, I can't rely on this technique for sets > 1 M points :)
    I guess clustering is the best way to go; are there any algorithems that come to mind? I googled and found the canopy clustering algorithem by McCallum et al.. Considering, that this creates overlapping canopies, and it should be a good starting point. However, it has not been implemented into Julia, right?
  3. Last but not least, I need to fit my 1d-data to custom models, using the Chi^2 algorithm.
    I had huge problems fitting arbitrary curves (Background, several Normal-distributions, linear offset) to my dataset. This should use the Chi^2 algorithm for fits weighted by the statistical error. Is that already implemented in any package?
Again -- for each of the tasks I have written some routines, but I have the feeling there is much faster and eleganter stuff out there. It would be a huge help of you guys if you dropped me a line with your thoughts on these points.


Looking forward to your input!

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Analysis of four-dimensional datasets

Cedric St-Jean-2
Hi, that sounds like interesting work! Some comments:

1. How are you storing a 4D dataset in a dataframe? One row per data-point? Could you store it in a 4D array instead? If dataframes work, that's great, but they are not optimized for every kind of use. You might get much better performance writing the join explicitly using another structure.

2. Do you have public code? In particular, have you profiled your code for a bottleneck? If you post some fragment that is slow, it'll be easier to provide concrete advice.

3. Re. Chi^2, I'm not familiar with that algorithm, but FYI there seems to be an R implementation, and you can call it with RCall.

4. If you want advice about choice of algorithm, you might have better luck asking in a forum specialized for your field. Also, consider posting on julia-users, it gets more traffic.

Best,

Cédric

On Friday, April 22, 2016 at 10:33:19 AM UTC-4, 0kto wrote:
Hi all,

I started working with Julia when I started my PhD / physics and I wrote quite a bit for my research. However, not being educated in programming, I currently reach some limits; performance wise and my knowledge of existing algorithms is limited as well. Now I am here to ask for (general) help.

  1. I am working in physics, scattering techniques. Therefore I work a lot with four dimensional datasets (reciprocal space and energy, (H,K,L,E)), which tend to be huge (several gigs).
    I found it quite easy to work with DataFrames here.
  2. From a 4d dataset, I need to reduce dimensions.
    Example: say H,E gets binned, and K and L are integrated and normalized and the statistical error is calculated. The result would be sth like a 2D dataset with axis along H and E.  H = [-2,0.1,2]; -0.1 < K < 0.1; 1 < L < 4; E = [0,1,45]
    I need to find the optimal region to present data, searching in a 4d dataset is hugely inconvenient. So far I used a DataFrame's join() on several fields, which creates a small set of duplicate / similar datapoints, which are then combined. This works on small datasets, but as you compare each point against a huge set of points, I can't rely on this technique for sets > 1 M points :)
    I guess clustering is the best way to go; are there any algorithems that come to mind? I googled and found the canopy clustering algorithem by McCallum et al.. Considering, that this creates overlapping canopies, and it should be a good starting point. However, it has not been implemented into Julia, right?
  3. Last but not least, I need to fit my 1d-data to custom models, using the Chi^2 algorithm.
    I had huge problems fitting arbitrary curves (Background, several Normal-distributions, linear offset) to my dataset. This should use the Chi^2 algorithm for fits weighted by the statistical error. Is that already implemented in any package?
Again -- for each of the tasks I have written some routines, but I have the feeling there is much faster and eleganter stuff out there. It would be a huge help of you guys if you dropped me a line with your thoughts on these points.


Looking forward to your input!

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Analysis of four-dimensional datasets

Julia Stats mailing list
Hi Cedric, thank you for you input!

On Saturday, April 23, 2016 at 3:22:40 AM UTC+2, Cedric St-Jean wrote:
Hi, that sounds like interesting work! Some comments:
1. How are you storing a 4D dataset in a dataframe? One row per data-point? Could you store it in a 4D array instead? If dataframes work, that's great, but they are not optimized for every kind of use. You might get much better performance writing the join explicitly using another structure.
 
Well, I was not clear enough: The coordinates of each point are four-dimensional, and then there are several fields to each point: Intensity, Error, Environment,... Currently everything is stored as one point = one row in the DataFrame.
 

2. Do you have public code? In particular, have you profiled your code for a bottleneck? If you post some fragment that is slow, it'll be easier to provide concrete advice.
 
I wiltl get back with more concrete data, but I guess some part is the choice of commands.
It should be these two commands. The first selects the points needed from the whole set of datapoints, and puts them into a new dataframe workdf:

    for point = 1 : size(inputDataFrame,1)
      if Hrange[1] <= inputDataFrame[point,:QH] <= Hrange[2] && Krange[1] <= inputDataFrame[point,:QK] <= Krange[2] && Lrange[1] <= inputDataFrame[point,:QL] <= Lrange[2] && Erange[1] <= inputDataFrame[point,:EN] <= Erange[2]
        append!(workdf,inputDataFrame[point,virginColumns])
      end
    end

Depending on the slice I cut, I then combine the statistics. To get the wanted slice (say, 2D) I define the axis-columns and then perform a join() command. Then I just need to normalize and calc Errors, but that is fast.

    point_current = datadf[1,:]
    criteria = [:H,:E] # select columns that need to match, these give the "Axis" of the slice
    combinedf = join(point_current[1,criteria],workdf, on=criteria)

Currently, I have a DataFile of 1.8Gb size, amounting to several tens of Millions of points. I hope this clarifies why I want to move away from the current situation, and I think the direction to move is clustering of the data, and then selecting clusters instead of searching through all datapoints.


3. Re. Chi^2, I'm not familiar with that algorithm, but FYI there seems to be an R implementation, and you can call it with RCall.

Thanks, I will search for that one:)

4. If you want advice about choice of algorithm, you might have better luck asking in a forum specialized for your field. Also, consider posting on julia-users, it gets more traffic.

Thanks for the tip, will do shortly.

Best, 0kto

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Analysis of four-dimensional datasets

Julia Stats mailing list
In reply to this post by Cedric St-Jean-2
Hi,

I just recoded quite a bit, and after a short excursion into the world of multidimensional arrays I stayed at the Dataframe setup. With the DataFramesMeta package speed improves quite a bit, the bottleneck is now to get the data into a DataFrame (reading from disk/ascii is slow).

Getting the interesting region from the 6column / 4-axis Dataframe works like
    df = @where(df, :QH .> H[1], :QH .< H[2])
and works on this 3Million row dataframe almost instantly, and the integration then works fast on subsets. This way the timedelay from the raw data to the plot is almost gone :)

So thanks again for your hint, they were giving me some good ideas. I guess that solves most my questions!

On Saturday, April 23, 2016 at 3:22:40 AM UTC+2, Cedric St-Jean wrote:
Hi, that sounds like interesting work! Some comments:

1. How are you storing a 4D dataset in a dataframe? One row per data-point? Could you store it in a 4D array instead? If dataframes work, that's great, but they are not optimized for every kind of use. You might get much better performance writing the join explicitly using another structure.

2. Do you have public code? In particular, have you profiled your code for a bottleneck? If you post some fragment that is slow, it'll be easier to provide concrete advice.

3. Re. Chi^2, I'm not familiar with that algorithm, but FYI there seems to be an R implementation, and you can call it with RCall.

4. If you want advice about choice of algorithm, you might have better luck asking in a forum specialized for your field. Also, consider posting on julia-users, it gets more traffic.

Best,

Cédric

On Friday, April 22, 2016 at 10:33:19 AM UTC-4, 0kto wrote:
Hi all,

I started working with Julia when I started my PhD / physics and I wrote quite a bit for my research. However, not being educated in programming, I currently reach some limits; performance wise and my knowledge of existing algorithms is limited as well. Now I am here to ask for (general) help.

  1. I am working in physics, scattering techniques. Therefore I work a lot with four dimensional datasets (reciprocal space and energy, (H,K,L,E)), which tend to be huge (several gigs).
    I found it quite easy to work with DataFrames here.
  2. From a 4d dataset, I need to reduce dimensions.
    Example: say H,E gets binned, and K and L are integrated and normalized and the statistical error is calculated. The result would be sth like a 2D dataset with axis along H and E.  H = [-2,0.1,2]; -0.1 < K < 0.1; 1 < L < 4; E = [0,1,45]
    I need to find the optimal region to present data, searching in a 4d dataset is hugely inconvenient. So far I used a DataFrame's join() on several fields, which creates a small set of duplicate / similar datapoints, which are then combined. This works on small datasets, but as you compare each point against a huge set of points, I can't rely on this technique for sets > 1 M points :)
    I guess clustering is the best way to go; are there any algorithems that come to mind? I googled and found the canopy clustering algorithem by McCallum et al.. Considering, that this creates overlapping canopies, and it should be a good starting point. However, it has not been implemented into Julia, right?
  3. Last but not least, I need to fit my 1d-data to custom models, using the Chi^2 algorithm.
    I had huge problems fitting arbitrary curves (Background, several Normal-distributions, linear offset) to my dataset. This should use the Chi^2 algorithm for fits weighted by the statistical error. Is that already implemented in any package?
Again -- for each of the tasks I have written some routines, but I have the feeling there is much faster and eleganter stuff out there. It would be a huge help of you guys if you dropped me a line with your thoughts on these points.


Looking forward to your input!

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: Analysis of four-dimensional datasets

Cedric St-Jean-2
Cheers! If your data is in CSV you can convert it to a better format to solve your loading time issue. Or you can use JLD.save(), JLD.write() to do the same thing. That's what I usually do.

On Mon, Apr 25, 2016 at 9:56 AM, '0kto' via julia-stats <[hidden email]> wrote:
Hi,

I just recoded quite a bit, and after a short excursion into the world of multidimensional arrays I stayed at the Dataframe setup. With the DataFramesMeta package speed improves quite a bit, the bottleneck is now to get the data into a DataFrame (reading from disk/ascii is slow).

Getting the interesting region from the 6column / 4-axis Dataframe works like
    df = @where(df, :QH .> H[1], :QH .< H[2])
and works on this 3Million row dataframe almost instantly, and the integration then works fast on subsets. This way the timedelay from the raw data to the plot is almost gone :)

So thanks again for your hint, they were giving me some good ideas. I guess that solves most my questions!

On Saturday, April 23, 2016 at 3:22:40 AM UTC+2, Cedric St-Jean wrote:
Hi, that sounds like interesting work! Some comments:

1. How are you storing a 4D dataset in a dataframe? One row per data-point? Could you store it in a 4D array instead? If dataframes work, that's great, but they are not optimized for every kind of use. You might get much better performance writing the join explicitly using another structure.

2. Do you have public code? In particular, have you profiled your code for a bottleneck? If you post some fragment that is slow, it'll be easier to provide concrete advice.

3. Re. Chi^2, I'm not familiar with that algorithm, but FYI there seems to be an R implementation, and you can call it with RCall.

4. If you want advice about choice of algorithm, you might have better luck asking in a forum specialized for your field. Also, consider posting on julia-users, it gets more traffic.

Best,

Cédric

On Friday, April 22, 2016 at 10:33:19 AM UTC-4, 0kto wrote:
Hi all,

I started working with Julia when I started my PhD / physics and I wrote quite a bit for my research. However, not being educated in programming, I currently reach some limits; performance wise and my knowledge of existing algorithms is limited as well. Now I am here to ask for (general) help.

  1. I am working in physics, scattering techniques. Therefore I work a lot with four dimensional datasets (reciprocal space and energy, (H,K,L,E)), which tend to be huge (several gigs).
    I found it quite easy to work with DataFrames here.
  2. From a 4d dataset, I need to reduce dimensions.
    Example: say H,E gets binned, and K and L are integrated and normalized and the statistical error is calculated. The result would be sth like a 2D dataset with axis along H and E.  H = [-2,0.1,2]; -0.1 < K < 0.1; 1 < L < 4; E = [0,1,45]
    I need to find the optimal region to present data, searching in a 4d dataset is hugely inconvenient. So far I used a DataFrame's join() on several fields, which creates a small set of duplicate / similar datapoints, which are then combined. This works on small datasets, but as you compare each point against a huge set of points, I can't rely on this technique for sets > 1 M points :)
    I guess clustering is the best way to go; are there any algorithems that come to mind? I googled and found the canopy clustering algorithem by McCallum et al.. Considering, that this creates overlapping canopies, and it should be a good starting point. However, it has not been implemented into Julia, right?
  3. Last but not least, I need to fit my 1d-data to custom models, using the Chi^2 algorithm.
    I had huge problems fitting arbitrary curves (Background, several Normal-distributions, linear offset) to my dataset. This should use the Chi^2 algorithm for fits weighted by the statistical error. Is that already implemented in any package?
Again -- for each of the tasks I have written some routines, but I have the feeling there is much faster and eleganter stuff out there. It would be a huge help of you guys if you dropped me a line with your thoughts on these points.


Looking forward to your input!

--
You received this message because you are subscribed to a topic in the Google Groups "julia-stats" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/julia-stats/1tiC2aQOKOw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
For more options, visit https://groups.google.com/d/optout.