Call Julia from Pyspark

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Call Julia from Pyspark

Harish Kumar
I have a RDD with 10K columns and 70 million rows,  70 MM rows will be grouped into 2000-3000 groups based on a key attribute. I followed below steps 

1. Julia and Pyspark linked using pyjulia package
2. 70 MM rd is groupByKey
    def juliaCall(x):
      <<convert x (list of rows) to  list of list inputdata>>
       j = julia.Julia()
       jcode = """     """
       calc= j.eval(jcode )
      result = calc(inputdata)

      RDD.groupBy(key).map(lambda x: juliaCall(x))

It works fine foe Key (or group) with 50K records, but my each group got 100K to 3M records. in such cases Shuffle will be more and it will fail. Can anyoone guide me to over code this issue
I have cluster of 10 nodes, each node is of 116GB and 16cores. Standalone mode and i allocated only 10 cores per node. 

Any help?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Call Julia from Pyspark

Páll Haraldsson
I'm not sure this is helpful but could you use?:

https://github.com/dfdx/Spark.jl

css.csail.mit.edu/6.824/2014/projects/dennisw.pdf

I'm not familiar with PySpark or the above, so I'm not sure what the problem with scalability is or if this helps..


On Thursday, November 3, 2016 at 1:45:31 AM UTC, Harish Kumar wrote:
I have a RDD with 10K columns and 70 million rows,  70 MM rows will be grouped into 2000-3000 groups based on a key attribute. I followed below steps 

1. Julia and Pyspark linked using pyjulia package
2. 70 MM rd is groupByKey
    def juliaCall(x):
      <<convert x (list of rows) to  list of list inputdata>>
       j = julia.Julia()
       jcode = """     """
       calc= j.eval(jcode )
      result = calc(inputdata)

      RDD.groupBy(key).map(lambda x: juliaCall(x))

It works fine foe Key (or group) with 50K records, but my each group got 100K to 3M records. in such cases Shuffle will be more and it will fail. Can anyoone guide me to over code this issue
I have cluster of 10 nodes, each node is of 116GB and 16cores. Standalone mode and i allocated only 10 cores per node. 

Any help?
Loading...