Possibility for an MPI-based cluster manager for use on Cray systems?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Possibility for an MPI-based cluster manager for use on Cray systems?

Joshua Job
Hello all,

I recently acquired an account under a project at ORNL's Titan supercomputer, and had hoped to deploy some Julia codes I had written and used on my University's HPC cluster but I'm having some trouble. Titan only allows one to start processes on other computers via the "aprun" command, which is basically the same as mpirun. You can have processes communicate, but only via MPI (no sshing into compute nodes allowed).

I know there is an MPI.jl package available and a ClusterManagers.jl package available. Does anyone have any idea how much work would be involved in trying to create a cluster manager that passes messages between workers via MPI rather than ssh? 

Alternatively, I primarily use pmap for parallel computation, so I may be able to get by with a wrapper script which will first compute which core will do what task in the pmap-like operation and then create a configuration file that all the julia workers can see what job they should do given their MPI rank from MPI.jl. That might work, but it isn't as clean as the real pmap.

Thanks in advance for any guidance y'all might be able to give!
-Josh.
Reply | Threaded
Open this post in threaded view
|

Re: Possibility for an MPI-based cluster manager for use on Cray systems?

Viral Shah
Do look at ClusterManager.jl, which has support for Sun GridEngine. It may be possible to add aprun support much the same way, without too much effort.

MPI.jl is a package that helps processes in the cluster talk over MPI, instead of the default socket based communication that Julia uses.

-viral

On Friday, August 15, 2014 9:55:35 AM UTC+5:30, Joshua Job wrote:
Hello all,

I recently acquired an account under a project at ORNL's Titan supercomputer, and had hoped to deploy some Julia codes I had written and used on my University's HPC cluster but I'm having some trouble. Titan only allows one to start processes on other computers via the "aprun" command, which is basically the same as mpirun. You can have processes communicate, but only via MPI (no sshing into compute nodes allowed).

I know there is an MPI.jl package available and a ClusterManagers.jl package available. Does anyone have any idea how much work would be involved in trying to create a cluster manager that passes messages between workers via MPI rather than ssh? 

Alternatively, I primarily use pmap for parallel computation, so I may be able to get by with a wrapper script which will first compute which core will do what task in the pmap-like operation and then create a configuration file that all the julia workers can see what job they should do given their MPI rank from MPI.jl. That might work, but it isn't as clean as the real pmap.

Thanks in advance for any guidance y'all might be able to give!
-Josh.
Reply | Threaded
Open this post in threaded view
|

Re: Possibility for an MPI-based cluster manager for use on Cray systems?

Amit Murthy
Connections to Julia workers are regular socket streams. ClusterManagers.jl has modules to launch Julia workers on supported grid engines. We still need regular socket connectivity to the launched workers, following which pmap / @parallel can be used.
ssh to launch workers is supported as part of base Julia, but, to reiterate, ssh is only used to launch the workers, not for subsequent connections.

Having Julia use MPI for its core messaging infrastructure would involve changes in base/multi.jl . While Julia's messaging protocol is pretty simple, it is not currently possible to have a drop in replacement for the same. I cannot comment on the feasibility/extent of changes to have MPI as the native messaging layer as I am not really familiar with MPI as such. 

I support your idea of adding a pmap like call to MPI.jl.

BTW, I looked up http://en.wikipedia.org/wiki/Titan_(supercomputer) - nice!

  Amit  

On Friday, August 15, 2014 10:05:54 AM UTC+5:30, Viral Shah wrote:
Do look at ClusterManager.jl, which has support for Sun GridEngine. It may be possible to add aprun support much the same way, without too much effort.

MPI.jl is a package that helps processes in the cluster talk over MPI, instead of the default socket based communication that Julia uses.

-viral

On Friday, August 15, 2014 9:55:35 AM UTC+5:30, Joshua Job wrote:
Hello all,

I recently acquired an account under a project at ORNL's Titan supercomputer, and had hoped to deploy some Julia codes I had written and used on my University's HPC cluster but I'm having some trouble. Titan only allows one to start processes on other computers via the "aprun" command, which is basically the same as mpirun. You can have processes communicate, but only via MPI (no sshing into compute nodes allowed).

I know there is an MPI.jl package available and a ClusterManagers.jl package available. Does anyone have any idea how much work would be involved in trying to create a cluster manager that passes messages between workers via MPI rather than ssh? 

Alternatively, I primarily use pmap for parallel computation, so I may be able to get by with a wrapper script which will first compute which core will do what task in the pmap-like operation and then create a configuration file that all the julia workers can see what job they should do given their MPI rank from MPI.jl. That might work, but it isn't as clean as the real pmap.

Thanks in advance for any guidance y'all might be able to give!
-Josh.
Reply | Threaded
Open this post in threaded view
|

Re: Possibility for an MPI-based cluster manager for use on Cray systems?

Patrick Sanan
In reply to this post by Joshua Job
Hi Joshua - 

Did you continue working with this idea? I'd also like to experiment using Julia in the Cray environment (using aprun etc).

Best,
Patrick

On Friday, August 15, 2014 at 6:25:35 AM UTC+2, Joshua Job wrote:
Hello all,

I recently acquired an account under a project at ORNL's Titan supercomputer, and had hoped to deploy some Julia codes I had written and used on my University's HPC cluster but I'm having some trouble. Titan only allows one to start processes on other computers via the "aprun" command, which is basically the same as mpirun. You can have processes communicate, but only via MPI (no sshing into compute nodes allowed).

I know there is an MPI.jl package available and a ClusterManagers.jl package available. Does anyone have any idea how much work would be involved in trying to create a cluster manager that passes messages between workers via MPI rather than ssh? 

Alternatively, I primarily use pmap for parallel computation, so I may be able to get by with a wrapper script which will first compute which core will do what task in the pmap-like operation and then create a configuration file that all the julia workers can see what job they should do given their MPI rank from MPI.jl. That might work, but it isn't as clean as the real pmap.

Thanks in advance for any guidance y'all might be able to give!
-Josh.
Reply | Threaded
Open this post in threaded view
|

Re: Possibility for an MPI-based cluster manager for use on Cray systems?

Erik Schnetter
You don't need a cluster manager to use MPI with Julia. You can start the Julia processes in the MPI-usual way via "aprun julia myproc". In the Julia code, you can then use MPI to determine the workers' rank etc.

I have written a semi-usable set of communication primitives that work in this environment (@rexec, @par, etc.; names are different from standard Julia) <https://bitbucket.org/eschnett/funhpc.jl>, but I haven't measured or improved performance for this yet.

-erik

> On Feb 5, 2015, at 8:53 , Patrick Sanan <[hidden email]> wrote:
>
> Hi Joshua -
>
> Did you continue working with this idea? I'd also like to experiment using Julia in the Cray environment (using aprun etc).
>
> Best,
> Patrick
>
> On Friday, August 15, 2014 at 6:25:35 AM UTC+2, Joshua Job wrote:
> Hello all,
>
> I recently acquired an account under a project at ORNL's Titan supercomputer, and had hoped to deploy some Julia codes I had written and used on my University's HPC cluster but I'm having some trouble. Titan only allows one to start processes on other computers via the "aprun" command, which is basically the same as mpirun. You can have processes communicate, but only via MPI (no sshing into compute nodes allowed).
>
> I know there is an MPI.jl package available and a ClusterManagers.jl package available. Does anyone have any idea how much work would be involved in trying to create a cluster manager that passes messages between workers via MPI rather than ssh?
>
> Alternatively, I primarily use pmap for parallel computation, so I may be able to get by with a wrapper script which will first compute which core will do what task in the pmap-like operation and then create a configuration file that all the julia workers can see what job they should do given their MPI rank from MPI.jl. That might work, but it isn't as clean as the real pmap.
>
> Thanks in advance for any guidance y'all might be able to give!
> -Josh.
--
Erik Schnetter <[hidden email]>
http://www.perimeterinstitute.ca/personal/eschnetter/

My email is as private as my paper mail. I therefore support encrypting
and signing email messages. Get my PGP key from https://sks-keyservers.net.


signature.asc (210 bytes) Download Attachment