Quantcast

Running julia with HTCondor

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Running julia with HTCondor

Roshan Chaudhari
I have one central manager and other 3 nodes. I installed HTCondor and Julia on all these machines and when I check condor_status, it list down all the nodes. Now, I am trying to use julia with htcondor, So I ran below commands:

export HOSTNAME=`hostname`

julia> Pkg.add("ClusterManagers")

julia> using ClusterManagers

addprocs(3, cman=HTCManager())
Submitting job(s)...
Waiting for 3 workers:

but as you can see the last one waits for the workers. What I am missing here?



Thanks,
Roshan
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Running julia with HTCondor

Pere
I've run into the same problem. Did you manage to fix it?

On Thursday, 25 September 2014 20:32:15 UTC+2, Roshan Chaudhari wrote:
I have one central manager and other 3 nodes. I installed HTCondor and Julia on all these machines and when I check condor_status, it list down all the nodes. Now, I am trying to use julia with htcondor, So I ran below commands:

export HOSTNAME=`hostname`

julia> Pkg.add("ClusterManagers")

julia> using ClusterManagers

addprocs(3, cman=HTCManager())
Submitting job(s)...
Waiting for 3 workers:

but as you can see the last one waits for the workers. What I am missing here?



Thanks,
Roshan
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Running julia with HTCondor

Angel de Vicente
Hi,

Pere <[hidden email]> writes:

> I've run into the same problem. Did you manage to fix it?
>
> On Thursday, 25 September 2014 20:32:15 UTC+2, Roshan Chaudhari wrote:
>     I have one central manager and other 3 nodes. I installed HTCondor and Julia
>     on all these machines and when I check condor_status, it list down all the
>     nodes. Now, I am trying to use julia with htcondor, So I ran below commands:
>    
>     export HOSTNAME=`hostname`
>    
>     julia> Pkg.add("ClusterManagers")
>    
>     julia> using ClusterManagers
>    
>     addprocs(3, cman=HTCManager())
>     Submitting job(s)...
>     Waiting for 3 workers:
>    
>     but as you can see the last one waits for the workers. What I am missing
>     here?


I'm also trying to run julia with Condor, and so far the same issue as
you. I'm not sure if this will be your case, but in my case the problem
seems twofold:

1- I have julia installed in a directory that is not accessible to the
workers (for the moment, for testing I just modified by hand the
condor.jl so that I can get to those files, and this is not a problem
anymore).

2- workers try to connect to the master via telnet (at port 8553), which
in my workstation is disabled.

Now I have to go, but later on I will try to configure telnet properly
and see if I can get it running.

Cheers,
--
Ángel de Vicente
http://www.iac.es/galeria/angelv/         
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Running julia with HTCondor

Angel de Vicente
Hi again,

Angel de Vicente <[hidden email]> writes:

> I'm also trying to run julia with Condor, and so far the same issue as
> you. I'm not sure if this will be your case, but in my case the problem
> seems twofold:
>
> 1- I have julia installed in a directory that is not accessible to the
> workers (for the moment, for testing I just modified by hand the
> condor.jl so that I can get to those files, and this is not a problem
> anymore).
>
> 2- workers try to connect to the master via telnet (at port 8553), which
> in my workstation is disabled.

Well, looking at the code in condor.jl, it is actually a random port
from 8000 to 9000. In the end, I didn't like the idea of installing
telnet just to be able to run Julia+Condor, so I tried to use ssh
instead, but I'm not getting very far. The workers get submitted, and
they try to run julia --worker, but then I get this message about
julia_worker:9009 in the Condor error files that I'm not sure where it
comes from.

,----
| Pseudo-terminal will not be allocated because stdin is not a terminal.^M
| julia_worker:9009: Command not found.
| Master process (id 1) could not connect within 60.0 seconds.
| exiting.
`----

In any case, I'm not sure how robust the Julia-Condor connection is. It
seems (correct me if I'm wrong, as I haven't been able to use it yet)
that it is based in the idea that Condor is like other workload
managers, so I would request a number of workers and then use them for a
parallel computation, assuming that they are going to be there all the
time. But the beauty of Condor is mainly that it is an opportunistic
scheduler, so I have 10000 tasks and Condor will start executing them in
whatever resources are available, perhaps only 10 workers now and
perhaps 200 workers later, and if the workers get unavailable while in
the middle of the task, then the task is automatically rescheduled to
another worker where it will start all over again (unless checkpointing
is enabled).

If somebody can shed some light on how to get Julia+HTCondor working
properly would be great, as we have a large Condor pool at work, and it
would be very interesting to try it.

Cheers,
--
Ángel de Vicente
http://www.iac.es/galeria/angelv/         
Loading...