Machine learning

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
38 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Machine learning

Sreenivas Raghavan
Is it worth to consider the idea of implementing apache mahouts MapReduce capabilities and  Sparks Mlib functions in Julia?
Reply | Threaded
Open this post in threaded view
|

Re: Machine learning

waTeim


On Tuesday, June 16, 2015 at 11:49:36 AM UTC-4, Sreenivas Raghavan wrote:
Is it worth to consider the idea of implementing apache mahouts MapReduce capabilities and  Sparks Mlib functions in Julia?

Did someone say Spark? 
Reply | Threaded
Open this post in threaded view
|

Re: Machine learning

Steven Sagaert
In reply to this post by Sreenivas Raghavan
Nope. What would make sense is create a julia interface to Spark.

On Tuesday, June 16, 2015 at 5:49:36 PM UTC+2, Sreenivas Raghavan wrote:
Is it worth to consider the idea of implementing apache mahouts MapReduce capabilities and  Sparks Mlib functions in Julia?
Reply | Threaded
Open this post in threaded view
|

Re: Machine learning

Randy Zwitch
If someone more conceptually sound in computer science could start the Spark project, I think there would be a lot of contributors (such as myself).

On Thursday, June 18, 2015 at 7:38:32 AM UTC-4, Steven Sagaert wrote:
Nope. What would make sense is create a julia interface to Spark.

On Tuesday, June 16, 2015 at 5:49:36 PM UTC+2, Sreenivas Raghavan wrote:
Is it worth to consider the idea of implementing apache mahouts MapReduce capabilities and  Sparks Mlib functions in Julia?
Reply | Threaded
Open this post in threaded view
|

Re: Machine learning

Andrei Zh
There's already project for integrating Julia and Spark [1]. Though, it has some conceptual shortcomings which I described in the Issues section of the project, so I'm also reconsidering architecture in my own project [2] (which is almost empty for now, but I'm going to come back to it soon). If somebody is interested, I can describe needed steps and expected results. 

[1]: https://github.com/jey/Spock.jl
[2]: https://github.com/dfdx/Sparta.jl


On Thursday, June 18, 2015 at 5:58:44 PM UTC+3, Randy Zwitch wrote:
If someone more conceptually sound in computer science could start the Spark project, I think there would be a lot of contributors (such as myself).

On Thursday, June 18, 2015 at 7:38:32 AM UTC-4, Steven Sagaert wrote:
Nope. What would make sense is create a julia interface to Spark.

On Tuesday, June 16, 2015 at 5:49:36 PM UTC+2, Sreenivas Raghavan wrote:
Is it worth to consider the idea of implementing apache mahouts MapReduce capabilities and  Sparks Mlib functions in Julia?
Reply | Threaded
Open this post in threaded view
|

Re: Machine learning

waTeim
What stops someone from looking at the PySpark implementation ....

 ... and just simply port PySpark to Julia?  Just to get something
usable.  How hard can it be?
Reply | Threaded
Open this post in threaded view
|

Re: Machine learning

Andrei Zh
@Jeff: both - PySpark and SparkR - are pretty simple in their core and can be easily translated to Julia. There are many hidden issues, however. One of the most annoying things, for example, is segfault on every second exception in JavaCall (see, for example, #6 and #8). Another unexpected hing that I encountered is that JavaCall crashes when using Scala classes, so only Java wrappers can be used from Julia. But other than that it is just a question of free time/hands. 



On Thu, Jun 18, 2015 at 10:34 PM, Jeff Waller <[hidden email]> wrote:
What stops someone from looking at the PySpark implementation ....

 ... and just simply port PySpark to Julia?  Just to get something
usable.  How hard can it be?

Reply | Threaded
Open this post in threaded view
|

Re: Machine learning

Avik Sengupta
I should have a fix for that JavaCall  segfault on exception soon (..ish)

Regards
-
Avik

On Thursday, 18 June 2015 22:54:22 UTC+1, Andrei Zh wrote:
@Jeff: both - PySpark and SparkR - are pretty simple in their core and can be easily translated to Julia. There are many hidden issues, however. One of the most annoying things, for example, is segfault on every second exception in JavaCall (see, for example, <a href="https://github.com/aviks/JavaCall.jl/issues/6" target="_blank" rel="nofollow" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Faviks%2FJavaCall.jl%2Fissues%2F6\46sa\75D\46sntz\0751\46usg\75AFQjCNEbrokMhJNNi6VBjwoHBSGhbryNWg';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Faviks%2FJavaCall.jl%2Fissues%2F6\46sa\75D\46sntz\0751\46usg\75AFQjCNEbrokMhJNNi6VBjwoHBSGhbryNWg';return true;">#6 and <a href="https://github.com/aviks/JavaCall.jl/issues/8" target="_blank" rel="nofollow" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Faviks%2FJavaCall.jl%2Fissues%2F8\46sa\75D\46sntz\0751\46usg\75AFQjCNFqRwFt2PiSg4B2z7USPozKwQOHRw';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Faviks%2FJavaCall.jl%2Fissues%2F8\46sa\75D\46sntz\0751\46usg\75AFQjCNFqRwFt2PiSg4B2z7USPozKwQOHRw';return true;">#8). Another unexpected hing that I encountered is that JavaCall crashes when using Scala classes, so only Java wrappers can be used from Julia. But other than that it is just a question of free time/hands. 



On Thu, Jun 18, 2015 at 10:34 PM, Jeff Waller <<a href="javascript:" target="_blank" gdf-obfuscated-mailto="5Ni5VCeh1zEJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">trut...@...> wrote:
What stops someone from looking at the PySpark implementation ....

 ... and just simply port PySpark to Julia?  Just to get something
usable.  How hard can it be?

Reply | Threaded
Open this post in threaded view
|

Re: Machine learning

Jonathan Malmaud
Hi all,
I am about to start working on the Julia machine learning ecosystem full time as my PhD thesis topic, directly alongside the Julia team at MIT. A satisfactory solution for large scale distributed computation is at the top of my agenda. I'm looking forward to working with you all to evolve Julia into a world-class platform for modern machine learning.
Reply | Threaded
Open this post in threaded view
|

Re: Machine learning

Elliot Saba
That's great news, Jonathan!  I look forward to seeing your work in this area!
-E

On Thu, Jun 18, 2015 at 5:43 PM, Jonathan Malmaud <[hidden email]> wrote:
Hi all,
I am about to start working on the Julia machine learning ecosystem full time as my PhD thesis topic, directly alongside the Julia team at MIT. A satisfactory solution for large scale distributed computation is at the top of my agenda. I'm looking forward to working with you all to evolve Julia into a world-class platform for modern machine learning.

Reply | Threaded
Open this post in threaded view
|

Re: Machine learning

John Myles White
Likewise. Very excited to see what you're do, Jonathan.

 -- John

On Jun 18, 2015, at 8:46 PM, Elliot Saba <[hidden email]> wrote:

That's great news, Jonathan!  I look forward to seeing your work in this area!
-E

On Thu, Jun 18, 2015 at 5:43 PM, Jonathan Malmaud <[hidden email]> wrote:
Hi all,
I am about to start working on the Julia machine learning ecosystem full time as my PhD thesis topic, directly alongside the Julia team at MIT. A satisfactory solution for large scale distributed computation is at the top of my agenda. I'm looking forward to working with you all to evolve Julia into a world-class platform for modern machine learning.


Reply | Threaded
Open this post in threaded view
|

Re: Machine learning

waTeim
In reply to this post by Andrei Zh


On Thursday, June 18, 2015 at 5:54:22 PM UTC-4, Andrei Zh wrote:
@Jeff: both - PySpark and SparkR - are pretty simple in their core and can be easily translated to Julia. There are many hidden issues, however. One of the most annoying things, for example, is segfault on every second exception in JavaCall (see, for example, <a href="https://github.com/aviks/JavaCall.jl/issues/6" target="_blank" rel="nofollow" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Faviks%2FJavaCall.jl%2Fissues%2F6\46sa\75D\46sntz\0751\46usg\75AFQjCNEbrokMhJNNi6VBjwoHBSGhbryNWg';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Faviks%2FJavaCall.jl%2Fissues%2F6\46sa\75D\46sntz\0751\46usg\75AFQjCNEbrokMhJNNi6VBjwoHBSGhbryNWg';return true;">#6 and <a href="https://github.com/aviks/JavaCall.jl/issues/8" target="_blank" rel="nofollow" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Faviks%2FJavaCall.jl%2Fissues%2F8\46sa\75D\46sntz\0751\46usg\75AFQjCNFqRwFt2PiSg4B2z7USPozKwQOHRw';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Faviks%2FJavaCall.jl%2Fissues%2F8\46sa\75D\46sntz\0751\46usg\75AFQjCNFqRwFt2PiSg4B2z7USPozKwQOHRw';return true;">#8). Another unexpected hing that I encountered is that JavaCall crashes when using Scala classes, so only Java wrappers can be used from Julia. But other than that it is just a question of free time/hands. 


signal (11): Segmentation fault
jl_f_get_field at /usr/bin/../lib/x86_64-linux-gnu/julia/libjulia.so (unknown line)

Yea that sounds annoying and hard to deal with when debugging.  But in #6 at least the stack trace seems to point at libjulia rather than Java (after looking at it for 1 minute).  You say that Java 1.8 doesn't work with Julia?  Not too surprising, it doesn't work with Spark either last time I looked (1.5 months ago).  Aug 2014?  Man it's been a while.  Is it still applicable?  If Java can the problem be avoided by using not Oracle Java you know just to test at first at least.



>Hi all, 
>I am about to start working on the Julia machine learning ecosystem full time as my PhD thesis topic, directly alongside the Julia team at MIT. A satisfactory solution for >large scale distributed computation is at the top of my agenda. I'm looking forward to working with you all to evolve Julia into a world-class platform for modern machine >learning.

That guy sounds motivated!

Reply | Threaded
Open this post in threaded view
|

Re: Machine learning

Sreenivas Raghavan
In reply to this post by Sreenivas Raghavan
Hi Jonanthan,
                   I am interested to work besides you on scalable machine learning. It would be better, if you make a roadmap for entire development process. I am willing to work according to that.  

On Tuesday, June 16, 2015 at 9:19:36 PM UTC+5:30, Sreenivas Raghavan wrote:
Is it worth to consider the idea of implementing apache mahouts MapReduce capabilities and  Sparks Mlib functions in Julia?
Reply | Threaded
Open this post in threaded view
|

Re: Machine learning

Andrei Zh
In reply to this post by Avik Sengupta
Great to hear and thanks for your incredible work! 

On Friday, June 19, 2015 at 2:04:49 AM UTC+3, Avik Sengupta wrote:
I should have a fix for that JavaCall  segfault on exception soon (..ish)

Regards
-
Avik

On Thursday, 18 June 2015 22:54:22 UTC+1, Andrei Zh wrote:
@Jeff: both - PySpark and SparkR - are pretty simple in their core and can be easily translated to Julia. There are many hidden issues, however. One of the most annoying things, for example, is segfault on every second exception in JavaCall (see, for example, <a href="https://github.com/aviks/JavaCall.jl/issues/6" rel="nofollow" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Faviks%2FJavaCall.jl%2Fissues%2F6\46sa\75D\46sntz\0751\46usg\75AFQjCNEbrokMhJNNi6VBjwoHBSGhbryNWg';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Faviks%2FJavaCall.jl%2Fissues%2F6\46sa\75D\46sntz\0751\46usg\75AFQjCNEbrokMhJNNi6VBjwoHBSGhbryNWg';return true;">#6 and <a href="https://github.com/aviks/JavaCall.jl/issues/8" rel="nofollow" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Faviks%2FJavaCall.jl%2Fissues%2F8\46sa\75D\46sntz\0751\46usg\75AFQjCNFqRwFt2PiSg4B2z7USPozKwQOHRw';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Faviks%2FJavaCall.jl%2Fissues%2F8\46sa\75D\46sntz\0751\46usg\75AFQjCNFqRwFt2PiSg4B2z7USPozKwQOHRw';return true;">#8). Another unexpected hing that I encountered is that JavaCall crashes when using Scala classes, so only Java wrappers can be used from Julia. But other than that it is just a question of free time/hands. 



On Thu, Jun 18, 2015 at 10:34 PM, Jeff Waller <[hidden email]> wrote:
What stops someone from looking at the PySpark implementation ....

 ... and just simply port PySpark to Julia?  Just to get something
usable.  How hard can it be?

Reply | Threaded
Open this post in threaded view
|

Re: Machine learning

Andrei Zh
In reply to this post by waTeim


You say that Java 1.8 doesn't work with Julia?  Not too surprising, it doesn't work with Spark either last time I looked (1.5 months ago).  

It's even deeper: Scala 2.10 doesn't support Java 8, and Scala 2.11 has known issues (something with string representation, IIRC).

 
If Java can the problem be avoided by using not Oracle Java you know just to test at first at least.

I tried OpenJDK first, and it was even worse.  


Reply | Threaded
Open this post in threaded view
|

Re: Machine learning

wildart
For a Spark integration is required a Java-Julia (de)serializer like it's done for Python and R. Python frontend based on Py4j. Similar RPC is done for Spark-R. Because Spark is bunch of transformation (read functions), it is needed to map these transformations into frontend language function calls. IMHO, Spark-Julia binding deserve a try, but I'm more inclined to pure Julia implementation of Spark transformations. Native Julia Spark package on top Elly.jl would be beat any JVM based implementation by performance and resource usage.

On Friday, June 19, 2015 at 5:20:09 AM UTC-4, Andrei Zhabinski wrote:


You say that Java 1.8 doesn't work with Julia?  Not too surprising, it doesn't work with Spark either last time I looked (1.5 months ago).  

It's even deeper: Scala 2.10 doesn't support Java 8, and Scala 2.11 has known issues (something with string representation, IIRC).

 
If Java can the problem be avoided by using not Oracle Java you know just to test at first at least.

I tried OpenJDK first, and it was even worse.  


Reply | Threaded
Open this post in threaded view
|

Re: Machine learning

waTeim


On Friday, June 19, 2015 at 1:26:55 PM UTC-4, [hidden email] wrote:
For a Spark integration is required a Java-Julia (de)serializer like it's done for Python and R. Python frontend based on Py4j. Similar RPC is done for Spark-R. Because Spark is bunch of transformation (read functions), it is needed to map these transformations into frontend language function calls. IMHO, Spark-Julia binding deserve a try, but I'm more inclined to pure Julia implementation of Spark transformations. Native Julia Spark package on top Elly.jl would be beat any JVM based implementation by performance and resource usage.

I believe that getting something that actually runs will inspire people to try it and make it better, and it (part 1) can be completed
before the end of the summer.  

Let me see if I can understand this problem.  Are you saying it's this for example

PySpark

someRDD.reduceByKey(lambda x,y: x+y)    ---> mapped to Java via Py4j

JuliaSpark

reduceByKey(someRDD,(x,y)->x+y) ---->  mapped to Java via X   <--- what does this need to be

is the tricky part is coming up with a reasonable X?


Reply | Threaded
Open this post in threaded view
|

Re: Machine learning

Andrei Zh
I feel like it's worth to describe how PySpark is implemented and what is needed to connect Julia to Spark in the same manner. 

In Spark, central concept is RDD - distributed collection of data partitions (splits). There are many different types of RDDs, such as MapPartitionsRDD, ShuffledRDD, CheckpointRDD, etc. Each type of RDD introduces conceptually new feature, e.g. MapPartitionsRDD is used to implement `map()`, `flatMap()`, `mapPartitions()` and similar methods, ShuffleRDD is responsible for shuffling data between machines, etc.  

To create new feature, every type of RDD should implement at least one method - `compute(split: Partition, context: TaskContext): Iterator[T]`. Essentially, `compute()` takes input partition data iterator and returns output data iterator. This is very similar to how `mapPartitions()` works - it simply applies some arbitrary transformation to every partition (and this is essentially what MapPartitionsRDD's `compute()` method does[1]). Some RDDs also involve data shuffling and working with external resources, but they aren't that important for our talk. 

PythonRDD comes in here. In its `compute()` method PythonRDD:

1) creates or reuses Python process [2]
2) writes serialized command and input data to the Python process (in a separate thread) [3]
3) reads results from the Python process [4]

So Scala talks to Python process via socket interface using simple custom protocol. 

But essentially we want Python to talk to JVM and not vice versa. This is where Py4J is useful. PySpark driver creates JVM and uses it to maintain all needed objects (mainly, SparkContext and RDD). Python's RDD (i.e. "class RDD" in "rdd.py") keeps reference to the corresponding JVM object ("PythonRDD" in "PythonRDD.scala") and calls its methods. When we write something like this in PySpark:

    rdd = sc.textFile(...)
    rdd.map(lambda x: x**2)
       .collect()

what happens is actually this:

1) Python's RDD is created pointing to PythonRDD object in JVM
2) subclass of Python's RDD - PipelinedRDD - is created; PipelinedRDD keeps reference to previous RDD and function `f = lambda x: x**2` to be applied to each record in original RDD
3) `PipelinedRDD.count()` leads to calling corresponding method in PythonRDD, then to passing all data through sockets to Python processes on workers, collecting results and passing it back to Python process on the driver machine. 

PipelinedRDD is called so because it can pipeline Julia functions in `map()` and `reduce()` operations without the need to go back to JVM, but this is mostly optimization. 

Some points are simplified or may contain errors, but essentially this is more or less how it works. 

So what do we need to implement Julia-Spark connector? Essentially, there are only 2 things - Julia-aware workers (JuliaRDD) and driver types/functions (RDD, PipelinedRDD) to call it directly from Julia. Not that much! 



[1]: https://github.com/apache/spark/blob/branch-1.4/core/src/main/scala/org/apache/spark/rdd/MapPartitionsRDD.scala#L34
[2]: https://github.com/apache/spark/blob/branch-1.4/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L73
[3]: https://github.com/apache/spark/blob/branch-1.4/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L208
[4]: https://github.com/apache/spark/blob/branch-1.4/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L106


On Friday, June 19, 2015 at 11:16:20 PM UTC+3, Jeff Waller wrote:


On Friday, June 19, 2015 at 1:26:55 PM UTC-4, [hidden email] wrote:
For a Spark integration is required a Java-Julia (de)serializer like it's done for Python and R. Python frontend based on Py4j. Similar RPC is done for Spark-R. Because Spark is bunch of transformation (read functions), it is needed to map these transformations into frontend language function calls. IMHO, Spark-Julia binding deserve a try, but I'm more inclined to pure Julia implementation of Spark transformations. Native Julia Spark package on top Elly.jl would be beat any JVM based implementation by performance and resource usage.

I believe that getting something that actually runs will inspire people to try it and make it better, and it (part 1) can be completed
before the end of the summer.  

Let me see if I can understand this problem.  Are you saying it's this for example

PySpark

someRDD.reduceByKey(lambda x,y: x+y)    ---> mapped to Java via Py4j

JuliaSpark

reduceByKey(someRDD,(x,y)->x+y) ---->  mapped to Java via X   <--- what does this need to be

is the tricky part is coming up with a reasonable X?


Reply | Threaded
Open this post in threaded view
|

Re: Machine learning

Andrei Zh
In reply to this post by wildart

IMHO, Spark-Julia binding deserve a try, but I'm more inclined to pure Julia implementation of Spark transformations. Native Julia Spark package on top Elly.jl would be beat any JVM based implementation by performance and resource usage.

Spark is more than just data transformations. It also has remarkable distributed memory management system, great integration with HDFS, YARN and Mesos support, one of the fastest SQL engines, easy access to all Hadoop data formats and tons of already written code, including their Streaming, GraphX and MLlib. I don't think it's possible to do these things better in Julia, at least not in reasonable time. 

For me it sounds like there's only one place that we can do better with pure Julia implementation - distributed linear algebra and machine learning system for big data. PySpark uses NumPy for MLlib, while Scala counterpart is based on Breeze [1] library. Both - conversions between data representations and passing data between JVM and Python processes - have their cost, sometimes larger than computations themselves. So if we concentrated on this specific part, and had integration with Spark as a separate project, we could probably get the best from both worlds. 

[1]: https://github.com/scalanlp/breeze

 
Reply | Threaded
Open this post in threaded view
|

Re: Machine learning

wildart
I agree, JVM world way ahead in supporting various big data frameworks. So, this opportunity should be exploited to provide a top experience in Julia. If we talk about tasks that operate on large amounts of data where JVM-Julia overhead could be disregarded then it is a totally justifiable approach to integrate with Spark. What I wanted to say, a fault-tolerant infrastructure for Julia parallel computations is required and RDD is a greate framework which could lead to this effort. I believe Julia could provide a better programming experience with current parallel computational model when code could be easily transferred to any computational node and executed there. This is an ultimate goal for big data processing when you able to scale your program to data center without changing a lot of code (just add @parallel). And Spark does it in a limited way with RDD transformations, trying to mitigate problem of transferring, running and tracking code in JVM environment in an integrated process. But the problem is still there, you need code running on all data nodes, coordinate execution (a la MPI style), and make everything fault-tolerant. This is a hard problem to solve in JVM, whether as in Julia it could be done faster and easily.

On Saturday, June 20, 2015 at 12:50:13 PM UTC-4, Andrei Zhabinski wrote:

IMHO, Spark-Julia binding deserve a try, but I'm more inclined to pure Julia implementation of Spark transformations. Native Julia Spark package on top Elly.jl would be beat any JVM based implementation by performance and resource usage.

Spark is more than just data transformations. It also has remarkable distributed memory management system, great integration with HDFS, YARN and Mesos support, one of the fastest SQL engines, easy access to all Hadoop data formats and tons of already written code, including their Streaming, GraphX and MLlib. I don't think it's possible to do these things better in Julia, at least not in reasonable time. 

For me it sounds like there's only one place that we can do better with pure Julia implementation - distributed linear algebra and machine learning system for big data. PySpark uses NumPy for MLlib, while Scala counterpart is based on Breeze [1] library. Both - conversions between data representations and passing data between JVM and Python processes - have their cost, sometimes larger than computations themselves. So if we concentrated on this specific part, and had integration with Spark as a separate project, we could probably get the best from both worlds. 

[1]: <a href="https://github.com/scalanlp/breeze" target="_blank" rel="nofollow" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Fscalanlp%2Fbreeze\46sa\75D\46sntz\0751\46usg\75AFQjCNFlMuko2PzdxtkkPIXALgsrO2jpUg';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fgithub.com%2Fscalanlp%2Fbreeze\46sa\75D\46sntz\0751\46usg\75AFQjCNFlMuko2PzdxtkkPIXALgsrO2jpUg';return true;">https://github.com/scalanlp/breeze

 
12