Calling Julia from PySpark

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Calling Julia from PySpark

Harish Kumar
I have a RDD with 10K columns and 70 million rows,  70 MM rows will be grouped into 2000-3000 groups based on a key attribute. I followed below steps 

1. Julia and Pyspark linked using pyjulia package
2. 70 MM rd is groupByKey
    def juliaCall(x):
      <<convert x (list of rows) to  list of list inputdata>>
       j = julia.Julia()
       jcode = """     """
       calc= j.eval(jcode )
      result = calc(inputdata)

      RDD.groupBy(key).map(lambda x: juliaCall(x))

It works fine foe Key (or group) with 50K records, but my each group got 100K to 3M records. in such cases Shuffle will be more and it will fail. Can anyoone guide me to over code this issue
I have cluster of 10 nodes, each node is of 116GB and 16cores. Standalone mode and i allocated only 10 cores per node. 

Any help?
Reply | Threaded
Open this post in threaded view
|

Re: Calling Julia from PySpark

Stefan Karpinski
This question is better suited to the julia-users mailing list.

On Wed, Nov 2, 2016 at 8:00 PM, Harish Kumar <[hidden email]> wrote:
I have a RDD with 10K columns and 70 million rows,  70 MM rows will be grouped into 2000-3000 groups based on a key attribute. I followed below steps 

1. Julia and Pyspark linked using pyjulia package
2. 70 MM rd is groupByKey
    def juliaCall(x):
      <<convert x (list of rows) to  list of list inputdata>>
       j = julia.Julia()
       jcode = """     """
       calc= j.eval(jcode )
      result = calc(inputdata)

      RDD.groupBy(key).map(lambda x: juliaCall(x))

It works fine foe Key (or group) with 50K records, but my each group got 100K to 3M records. in such cases Shuffle will be more and it will fail. Can anyoone guide me to over code this issue
I have cluster of 10 nodes, each node is of 116GB and 16cores. Standalone mode and i allocated only 10 cores per node. 

Any help?