|
|
The High Performance Scripting team at Intel Labs is pleased to announce
the release of version 0.2 of ParallelAccelerator.jl, a package for
high-performance parallel computing in Julia, primarily oriented around arrays and stencils. In this release, we provide support for Julia 0.5 and introduce experimental support for the Julia native threading backend. While we still currently support Julia 0.4, such support should be considered deprecated and we recommend everyone move to Julia 0.5 as Julia 0.4 support may be removed in the future. The goal of ParallelAccelerator
is to accelerate the computational kernel of an application by the programmer simply annotating the kernel function with the @acc (short for "accelerate") macro, provided by the ParallelAccelerator package. In version 0.2, ParallelAccelerator still defaults to transforming the kernel to OpenMP C code that is then compiled with a system C compiler (ICC or GCC) and transparently handles the invocation of the C code from Julia as if the program were running normally. However, ParallelAccelerator v0.2 also introduces experimental backend support for Julia's native threading (which is also experimental). To enable native threading mode, set the environment variable PROSPECT_MODE=threads. In this mode, ParallelAccelerator identifies pieces of code that can be run in parallel and then runs that code as if it had been annotated with Julia's @threads and goes through the standard Julia compiler pipeline with LLVM. The ParallelAccelerator C backend has the limitation that the kernel functions and anything called by those cannot include code that is not type-stable to a single type. In particular, variables of type Any are not supported. In practice, this restriction was a significant limitation. For the native threading backend, no such restriction is necessary and thus our backend should handle arbitrary Julia code. Under the
hood, ParallelAccelerator is essentially a domain-specific compiler
written in Julia. It performs additional analysis and optimization on
top of the Julia compiler. ParallelAccelerator discovers and exploits
the implicit parallelism in source programs that use parallel
programming patterns such as map, reduce, comprehension, and stencil.
For example, Julia array operators such as .+, .-, .*, ./ are translated
by ParallelAccelerator internally into data-parallel map operations
over all elements of input arrays. For the most part, these patterns are
already present in standard Julia, so programmers can use
ParallelAccelerator to run the same Julia program without
(significantly) modifying the source code. Version 0.2 should be considered an alpha release, suitable for early adopters and Julia enthusiasts. Please file bugs at https://travis-ci.org/IntelLabs/ParallelAccelerator.jl/issues . See our GitHub repository at https://github.com/IntelLabs/ParallelAccelerator.jl for a complete list of prerequisites, supported platforms, example programs, and documentation. Thanks
to our colleagues at Intel and Intel Labs, the Julia team, and the
broader Julia community for their support of our efforts! Best regards, The High Performance Scripting team (Parallel Computing Lab, Intel Labs)
|
|
Actually, it seems that the pull request into Julia metadata to set the default ParallelAccelerator version has not yet been merged. So, if you want the simplest package update or to get the correct version with a simple Pkg.add then hold off until the merge happens. I'll post here again when that happens.
thanks,
Todd
|
|
Okay, METADATA with ParallelAccelerator verison 0.2 has been merged so if you do a standard Pkg.add() or update() you should get the latest version.
For native threads, please note that we've identified some issues with reductions and stencils that have been fixed and we will shortly be released in version 0.2.1. I will post here again when that release takes place.
Again, please give it a try and report back with experiences or file bugs.
thanks!
Todd
|
|
This is great stuff. Initial observations (under Linux/GCC) are that native threads are about 20% faster than OpenMP, so I surmise you are feeding LLVM some very tasty code. (I tested long loops with straightforward memory access.)
On the other hand, some of the earlier posts make me think that you were leveraging the strong vector optimization of the Intel C compiler and its tight coupling to MKL libraries. If so, is there any prospect of getting LLVM to take advantage of MKL?
On Wednesday, October 26, 2016 at 8:13:38 PM UTC-4, Todd Anderson wrote: Okay, METADATA with ParallelAccelerator verison 0.2 has been merged so if you do a standard Pkg.add() or update() you should get the latest version.
For native threads, please note that we've identified some issues with reductions and stencils that have been fixed and we will shortly be released in version 0.2.1. I will post here again when that release takes place.
Again, please give it a try and report back with experiences or file bugs.
thanks!
Todd
|
|
With appreciation for Intel Labs' commitment, our thanks to the people who landed v0.2 of the ParallelAccelerator project.On Wednesday, October 26, 2016 at 8:13:38 PM UTC-4, Todd Anderson wrote: Okay, METADATA with ParallelAccelerator verison 0.2 has been merged so if you do a standard Pkg.add() or update() you should get the latest version.
For native threads, please note that we've identified some issues with reductions and stencils that have been fixed and we will shortly be released in version 0.2.1. I will post here again when that release takes place.
Again, please give it a try and report back with experiences or file bugs.
thanks!
Todd
|
|
That's interesting. I generally don't test with gcc and my experiments with ICC/C have shown something like 20% slower for LLVM/native threads for some class of benchmarks (like blackscholes) but 2-4x slower for some other benchmarks (like laplace-3d). The 20% may be attributable to ICC being better (including at vectorization like you mention) but certainly not the 2-4x. These larger differences are still under investigation. I guess something we have said in the docs or our postings have created this impression that our performance gains are somehow related to MKL or blas in general. If you have MKL then you can compile Julia to use it through its LLVM path. ParallelAccelerator does not insert calls to MKL where they didn't exist in the incoming IR and I don't think ICC does either. If MKL calls exist in the incoming IR then we don't modify them either. On Wednesday, October 26, 2016 at 7:51:33 PM UTC-7, Ralph Smith wrote: This is great stuff. Initial observations (under Linux/GCC) are that native threads are about 20% faster than OpenMP, so I surmise you are feeding LLVM some very tasty code. (I tested long loops with straightforward memory access.)
On the other hand, some of the earlier posts make me think that you were leveraging the strong vector optimization of the Intel C compiler and its tight coupling to MKL libraries. If so, is there any prospect of getting LLVM to take advantage of MKL?
On Wednesday, October 26, 2016 at 8:13:38 PM UTC-4, Todd Anderson wrote: Okay, METADATA with ParallelAccelerator verison 0.2 has been merged so if you do a standard Pkg.add() or update() you should get the latest version.
For native threads, please note that we've identified some issues with reductions and stencils that have been fixed and we will shortly be released in version 0.2.1. I will post here again when that release takes place.
Again, please give it a try and report back with experiences or file bugs.
thanks!
Todd
|
|
Thank you for all of your amazing work. I will be giving v0.2 a try soon. But I have two questions:
1) How do you see ParallelAccelerator integrating with packages? I asked this in the chatroom, but I think having it here might be helpful for others to chime in. If I want to use ParallelAccelerator in a package, then it seems like I would have to require it (and make sure that every user I have can compile it!) and sprinkle the macros around. Is there some sensible way to be able to use ParallelAccelerator if it's available on the user's machine, but not otherwise? This might be something that requires Pkg3, but even with Pkg3 I don't see how to do this without having one version of the function with a macro, and another without it.
2) What do you see as the future of ParallelAccelerator going forward? It seems like Base Julia is stepping all over your domain: automated loop fusing, multithreading, etc. What exactly does ParallelAccelerator give that Base Julia does not or, in the near future, will not / can not? I am curious because with Base Julia getting so many optimizations itself, it's hard to tell whether supporting ParallelAccelerator will be a worthwhile investment in a year or two, and wanted to know what you guys think of that. I don't mean you haven't done great work: you clearly have, but it seems Julia is also doing a lot of great work! On Tuesday, October 25, 2016 at 9:42:44 AM UTC-7, Todd Anderson wrote: The High Performance Scripting team at Intel Labs is pleased to announce
the release of version 0.2 of ParallelAccelerator.jl, a package for
high-performance parallel computing in Julia, primarily oriented around arrays and stencils. In this release, we provide support for Julia 0.5 and introduce experimental support for the Julia native threading backend. While we still currently support Julia 0.4, such support should be considered deprecated and we recommend everyone move to Julia 0.5 as Julia 0.4 support may be removed in the future.
The goal of ParallelAccelerator
is to accelerate the computational kernel of an application by the programmer simply annotating the kernel function with the @acc (short for "accelerate") macro, provided by the ParallelAccelerator package. In version 0.2, ParallelAccelerator still defaults to transforming the kernel to OpenMP C code that is then compiled with a system C compiler (ICC or GCC) and transparently handles the invocation of the C code from Julia as if the program were running normally.
However, ParallelAccelerator v0.2 also introduces experimental backend support for Julia's native threading (which is also experimental). To enable native threading mode, set the environment variable PROSPECT_MODE=threads. In this mode, ParallelAccelerator identifies pieces of code that can be run in parallel and then runs that code as if it had been annotated with Julia's @threads and goes through the standard Julia compiler pipeline with LLVM. The ParallelAccelerator C backend has the limitation that the kernel functions and anything called by those cannot include code that is not type-stable to a single type. In particular, variables of type Any are not supported. In practice, this restriction was a significant limitation. For the native threading backend, no such restriction is necessary and thus our backend should handle arbitrary Julia code.
Under the
hood, ParallelAccelerator is essentially a domain-specific compiler
written in Julia. It performs additional analysis and optimization on
top of the Julia compiler. ParallelAccelerator discovers and exploits
the implicit parallelism in source programs that use parallel
programming patterns such as map, reduce, comprehension, and stencil.
For example, Julia array operators such as .+, .-, .*, ./ are translated
by ParallelAccelerator internally into data-parallel map operations
over all elements of input arrays. For the most part, these patterns are
already present in standard Julia, so programmers can use
ParallelAccelerator to run the same Julia program without
(significantly) modifying the source code.
Version 0.2 should be considered an alpha release, suitable for early adopters and Julia enthusiasts. Please file bugs at <a href="https://travis-ci.org/IntelLabs/ParallelAccelerator.jl/issues" rel="nofollow" target="_blank" onmousedown="this.href='https://www.google.com/url?q\x3dhttps%3A%2F%2Ftravis-ci.org%2FIntelLabs%2FParallelAccelerator.jl%2Fissues\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGrrE6EYD1UfAMHgTBuPs5hIhm6xg';return true;" onclick="this.href='https://www.google.com/url?q\x3dhttps%3A%2F%2Ftravis-ci.org%2FIntelLabs%2FParallelAccelerator.jl%2Fissues\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGrrE6EYD1UfAMHgTBuPs5hIhm6xg';return true;">https://travis-ci.org/IntelLabs/ParallelAccelerator.jl/issues .
See our GitHub repository at <a href="https://github.com/IntelLabs/ParallelAccelerator.jl" rel="nofollow" target="_blank" onmousedown="this.href='https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FIntelLabs%2FParallelAccelerator.jl\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNG8bQYqEynAw2JirTm8E_u2jY_ZdA';return true;" onclick="this.href='https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FIntelLabs%2FParallelAccelerator.jl\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNG8bQYqEynAw2JirTm8E_u2jY_ZdA';return true;">https://github.com/IntelLabs/ParallelAccelerator.jl for a complete list of prerequisites, supported platforms, example programs, and documentation.
Thanks
to our colleagues at Intel and Intel Labs, the Julia team, and the
broader Julia community for their support of our efforts!
Best regards, The High Performance Scripting team (Parallel Computing Lab, Intel Labs)
|
|
Not speaking on behalf of the ParallelAccelerator team - but the long term future of ParallelAccelerator in my opinion is to do exactly that - keep pushing on new things and get them (code or ideas) merged into Base as they stabilize. Without the ParallelAccelerator team pushing us, multi-threading would have moved much slower.
-viral On Thursday, October 27, 2016 at 11:17:57 PM UTC+5:30, Chris Rackauckas wrote: Thank you for all of your amazing work. I will be giving v0.2 a try soon. But I have two questions:
1) How do you see ParallelAccelerator integrating with packages? I asked this in the chatroom, but I think having it here might be helpful for others to chime in. If I want to use ParallelAccelerator in a package, then it seems like I would have to require it (and make sure that every user I have can compile it!) and sprinkle the macros around. Is there some sensible way to be able to use ParallelAccelerator if it's available on the user's machine, but not otherwise? This might be something that requires Pkg3, but even with Pkg3 I don't see how to do this without having one version of the function with a macro, and another without it.
2) What do you see as the future of ParallelAccelerator going forward? It seems like Base Julia is stepping all over your domain: automated loop fusing, multithreading, etc. What exactly does ParallelAccelerator give that Base Julia does not or, in the near future, will not / can not? I am curious because with Base Julia getting so many optimizations itself, it's hard to tell whether supporting ParallelAccelerator will be a worthwhile investment in a year or two, and wanted to know what you guys think of that. I don't mean you haven't done great work: you clearly have, but it seems Julia is also doing a lot of great work! On Tuesday, October 25, 2016 at 9:42:44 AM UTC-7, Todd Anderson wrote: The High Performance Scripting team at Intel Labs is pleased to announce
the release of version 0.2 of ParallelAccelerator.jl, a package for
high-performance parallel computing in Julia, primarily oriented around arrays and stencils. In this release, we provide support for Julia 0.5 and introduce experimental support for the Julia native threading backend. While we still currently support Julia 0.4, such support should be considered deprecated and we recommend everyone move to Julia 0.5 as Julia 0.4 support may be removed in the future.
The goal of ParallelAccelerator
is to accelerate the computational kernel of an application by the programmer simply annotating the kernel function with the @acc (short for "accelerate") macro, provided by the ParallelAccelerator package. In version 0.2, ParallelAccelerator still defaults to transforming the kernel to OpenMP C code that is then compiled with a system C compiler (ICC or GCC) and transparently handles the invocation of the C code from Julia as if the program were running normally.
However, ParallelAccelerator v0.2 also introduces experimental backend support for Julia's native threading (which is also experimental). To enable native threading mode, set the environment variable PROSPECT_MODE=threads. In this mode, ParallelAccelerator identifies pieces of code that can be run in parallel and then runs that code as if it had been annotated with Julia's @threads and goes through the standard Julia compiler pipeline with LLVM. The ParallelAccelerator C backend has the limitation that the kernel functions and anything called by those cannot include code that is not type-stable to a single type. In particular, variables of type Any are not supported. In practice, this restriction was a significant limitation. For the native threading backend, no such restriction is necessary and thus our backend should handle arbitrary Julia code.
Under the
hood, ParallelAccelerator is essentially a domain-specific compiler
written in Julia. It performs additional analysis and optimization on
top of the Julia compiler. ParallelAccelerator discovers and exploits
the implicit parallelism in source programs that use parallel
programming patterns such as map, reduce, comprehension, and stencil.
For example, Julia array operators such as .+, .-, .*, ./ are translated
by ParallelAccelerator internally into data-parallel map operations
over all elements of input arrays. For the most part, these patterns are
already present in standard Julia, so programmers can use
ParallelAccelerator to run the same Julia program without
(significantly) modifying the source code.
Version 0.2 should be considered an alpha release, suitable for early adopters and Julia enthusiasts. Please file bugs at <a href="https://travis-ci.org/IntelLabs/ParallelAccelerator.jl/issues" rel="nofollow" target="_blank" onmousedown="this.href='https://www.google.com/url?q\x3dhttps%3A%2F%2Ftravis-ci.org%2FIntelLabs%2FParallelAccelerator.jl%2Fissues\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGrrE6EYD1UfAMHgTBuPs5hIhm6xg';return true;" onclick="this.href='https://www.google.com/url?q\x3dhttps%3A%2F%2Ftravis-ci.org%2FIntelLabs%2FParallelAccelerator.jl%2Fissues\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGrrE6EYD1UfAMHgTBuPs5hIhm6xg';return true;">https://travis-ci.org/IntelLabs/ParallelAccelerator.jl/issues .
See our GitHub repository at <a href="https://github.com/IntelLabs/ParallelAccelerator.jl" rel="nofollow" target="_blank" onmousedown="this.href='https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FIntelLabs%2FParallelAccelerator.jl\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNG8bQYqEynAw2JirTm8E_u2jY_ZdA';return true;" onclick="this.href='https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FIntelLabs%2FParallelAccelerator.jl\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNG8bQYqEynAw2JirTm8E_u2jY_ZdA';return true;">https://github.com/IntelLabs/ParallelAccelerator.jl for a complete list of prerequisites, supported platforms, example programs, and documentation.
Thanks
to our colleagues at Intel and Intel Labs, the Julia team, and the
broader Julia community for their support of our efforts!
Best regards, The High Performance Scripting team (Parallel Computing Lab, Intel Labs)
|
|
To answer your question #1, would the following be suitable? There may be a couple details to work out but what about the general approach? if haskey(Pkg.installed(), "ParallelAccelerator") println("ParallelAccelerator present")
using ParallelAccelerator
macro PkgCheck(ast) quote @acc $(esc(ast)) end end else println("ParallelAccelerator not present")
macro PkgCheck(ast) return ast end end
@PkgCheck function f1(x) x * 5 end
a = f1(10) println("a = ", a)
2) The point of ParallelAccelerator is to extract the implicit parallelism automatically. The purpose of @threads is to allow you to express parallelism explicitly. So, they both enable parallelism but the former has the potential to be a lot easier to use particularly for scientific programmers who are more scientist than programmer. In general, I feel there is room for all approaches to be supported across a range of programming ability. On Thursday, October 27, 2016 at 10:47:57 AM UTC-7, Chris Rackauckas wrote: Thank you for all of your amazing work. I will be giving v0.2 a try soon. But I have two questions:
1) How do you see ParallelAccelerator integrating with packages? I asked this in the chatroom, but I think having it here might be helpful for others to chime in. If I want to use ParallelAccelerator in a package, then it seems like I would have to require it (and make sure that every user I have can compile it!) and sprinkle the macros around. Is there some sensible way to be able to use ParallelAccelerator if it's available on the user's machine, but not otherwise? This might be something that requires Pkg3, but even with Pkg3 I don't see how to do this without having one version of the function with a macro, and another without it.
2) What do you see as the future of ParallelAccelerator going forward? It seems like Base Julia is stepping all over your domain: automated loop fusing, multithreading, etc. What exactly does ParallelAccelerator give that Base Julia does not or, in the near future, will not / can not? I am curious because with Base Julia getting so many optimizations itself, it's hard to tell whether supporting ParallelAccelerator will be a worthwhile investment in a year or two, and wanted to know what you guys think of that. I don't mean you haven't done great work: you clearly have, but it seems Julia is also doing a lot of great work! On Tuesday, October 25, 2016 at 9:42:44 AM UTC-7, Todd Anderson wrote: The High Performance Scripting team at Intel Labs is pleased to announce
the release of version 0.2 of ParallelAccelerator.jl, a package for
high-performance parallel computing in Julia, primarily oriented around arrays and stencils. In this release, we provide support for Julia 0.5 and introduce experimental support for the Julia native threading backend. While we still currently support Julia 0.4, such support should be considered deprecated and we recommend everyone move to Julia 0.5 as Julia 0.4 support may be removed in the future.
The goal of ParallelAccelerator
is to accelerate the computational kernel of an application by the programmer simply annotating the kernel function with the @acc (short for "accelerate") macro, provided by the ParallelAccelerator package. In version 0.2, ParallelAccelerator still defaults to transforming the kernel to OpenMP C code that is then compiled with a system C compiler (ICC or GCC) and transparently handles the invocation of the C code from Julia as if the program were running normally.
However, ParallelAccelerator v0.2 also introduces experimental backend support for Julia's native threading (which is also experimental). To enable native threading mode, set the environment variable PROSPECT_MODE=threads. In this mode, ParallelAccelerator identifies pieces of code that can be run in parallel and then runs that code as if it had been annotated with Julia's @threads and goes through the standard Julia compiler pipeline with LLVM. The ParallelAccelerator C backend has the limitation that the kernel functions and anything called by those cannot include code that is not type-stable to a single type. In particular, variables of type Any are not supported. In practice, this restriction was a significant limitation. For the native threading backend, no such restriction is necessary and thus our backend should handle arbitrary Julia code.
Under the
hood, ParallelAccelerator is essentially a domain-specific compiler
written in Julia. It performs additional analysis and optimization on
top of the Julia compiler. ParallelAccelerator discovers and exploits
the implicit parallelism in source programs that use parallel
programming patterns such as map, reduce, comprehension, and stencil.
For example, Julia array operators such as .+, .-, .*, ./ are translated
by ParallelAccelerator internally into data-parallel map operations
over all elements of input arrays. For the most part, these patterns are
already present in standard Julia, so programmers can use
ParallelAccelerator to run the same Julia program without
(significantly) modifying the source code.
Version 0.2 should be considered an alpha release, suitable for early adopters and Julia enthusiasts. Please file bugs at <a href="https://travis-ci.org/IntelLabs/ParallelAccelerator.jl/issues" rel="nofollow" target="_blank" onmousedown="this.href='https://www.google.com/url?q\x3dhttps%3A%2F%2Ftravis-ci.org%2FIntelLabs%2FParallelAccelerator.jl%2Fissues\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGrrE6EYD1UfAMHgTBuPs5hIhm6xg';return true;" onclick="this.href='https://www.google.com/url?q\x3dhttps%3A%2F%2Ftravis-ci.org%2FIntelLabs%2FParallelAccelerator.jl%2Fissues\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGrrE6EYD1UfAMHgTBuPs5hIhm6xg';return true;">https://travis-ci.org/IntelLabs/ParallelAccelerator.jl/issues .
See our GitHub repository at <a href="https://github.com/IntelLabs/ParallelAccelerator.jl" rel="nofollow" target="_blank" onmousedown="this.href='https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FIntelLabs%2FParallelAccelerator.jl\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNG8bQYqEynAw2JirTm8E_u2jY_ZdA';return true;" onclick="this.href='https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FIntelLabs%2FParallelAccelerator.jl\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNG8bQYqEynAw2JirTm8E_u2jY_ZdA';return true;">https://github.com/IntelLabs/ParallelAccelerator.jl for a complete list of prerequisites, supported platforms, example programs, and documentation.
Thanks
to our colleagues at Intel and Intel Labs, the Julia team, and the
broader Julia community for their support of our efforts!
Best regards, The High Performance Scripting team (Parallel Computing Lab, Intel Labs)
|
|
Looks like version 0.2.1 has been merged now. On Wednesday, October 26, 2016 at 5:13:38 PM UTC-7, Todd Anderson wrote: Okay, METADATA with ParallelAccelerator verison 0.2 has been merged so if you do a standard Pkg.add() or update() you should get the latest version.
For native threads, please note that we've identified some issues with reductions and stencils that have been fixed and we will shortly be released in version 0.2.1. I will post here again when that release takes place.
Again, please give it a try and report back with experiences or file bugs.
thanks!
Todd
|
|
1) Won't that have bad interactions with pre-compilation? Since macros apply at parse time, the package will stay in the "state" the it precompiles in: so if one precompiles the package and then adds ParallelAccelerator, wouldn't that not be used? And the other way around, if one removes ParallelAccelerator, won't the package be unusable without manually deleting the precompile cache? I think that to use this you'd have to link the precompilation of the package to whether you have changes in ParallelAccelerator.
2) Shouldn't/Won't Base auto-parallelize broadcasted calls? That seems like the clear next step after loop fusing is finished and threading is no longer experimental. Where else is the implicit parallelism hiding? On Thursday, October 27, 2016 at 2:02:38 PM UTC-7, Todd Anderson wrote: To answer your question #1, would the following be suitable? There may be a couple details to work out but what about the general approach? if haskey(Pkg.installed(), "ParallelAccelerator") println("ParallelAccelerator present")
using ParallelAccelerator
macro PkgCheck(ast) quote @acc $(esc(ast)) end end else println("ParallelAccelerator not present")
macro PkgCheck(ast) return ast end end
@PkgCheck function f1(x) x * 5 end
a = f1(10) println("a = ", a)
2) The point of ParallelAccelerator is to extract the implicit parallelism automatically. The purpose of @threads is to allow you to express parallelism explicitly. So, they both enable parallelism but the former has the potential to be a lot easier to use particularly for scientific programmers who are more scientist than programmer. In general, I feel there is room for all approaches to be supported across a range of programming ability. On Thursday, October 27, 2016 at 10:47:57 AM UTC-7, Chris Rackauckas wrote: Thank you for all of your amazing work. I will be giving v0.2 a try soon. But I have two questions:
1) How do you see ParallelAccelerator integrating with packages? I asked this in the chatroom, but I think having it here might be helpful for others to chime in. If I want to use ParallelAccelerator in a package, then it seems like I would have to require it (and make sure that every user I have can compile it!) and sprinkle the macros around. Is there some sensible way to be able to use ParallelAccelerator if it's available on the user's machine, but not otherwise? This might be something that requires Pkg3, but even with Pkg3 I don't see how to do this without having one version of the function with a macro, and another without it.
2) What do you see as the future of ParallelAccelerator going forward? It seems like Base Julia is stepping all over your domain: automated loop fusing, multithreading, etc. What exactly does ParallelAccelerator give that Base Julia does not or, in the near future, will not / can not? I am curious because with Base Julia getting so many optimizations itself, it's hard to tell whether supporting ParallelAccelerator will be a worthwhile investment in a year or two, and wanted to know what you guys think of that. I don't mean you haven't done great work: you clearly have, but it seems Julia is also doing a lot of great work! On Tuesday, October 25, 2016 at 9:42:44 AM UTC-7, Todd Anderson wrote: The High Performance Scripting team at Intel Labs is pleased to announce
the release of version 0.2 of ParallelAccelerator.jl, a package for
high-performance parallel computing in Julia, primarily oriented around arrays and stencils. In this release, we provide support for Julia 0.5 and introduce experimental support for the Julia native threading backend. While we still currently support Julia 0.4, such support should be considered deprecated and we recommend everyone move to Julia 0.5 as Julia 0.4 support may be removed in the future.
The goal of ParallelAccelerator
is to accelerate the computational kernel of an application by the programmer simply annotating the kernel function with the @acc (short for "accelerate") macro, provided by the ParallelAccelerator package. In version 0.2, ParallelAccelerator still defaults to transforming the kernel to OpenMP C code that is then compiled with a system C compiler (ICC or GCC) and transparently handles the invocation of the C code from Julia as if the program were running normally.
However, ParallelAccelerator v0.2 also introduces experimental backend support for Julia's native threading (which is also experimental). To enable native threading mode, set the environment variable PROSPECT_MODE=threads. In this mode, ParallelAccelerator identifies pieces of code that can be run in parallel and then runs that code as if it had been annotated with Julia's @threads and goes through the standard Julia compiler pipeline with LLVM. The ParallelAccelerator C backend has the limitation that the kernel functions and anything called by those cannot include code that is not type-stable to a single type. In particular, variables of type Any are not supported. In practice, this restriction was a significant limitation. For the native threading backend, no such restriction is necessary and thus our backend should handle arbitrary Julia code.
Under the
hood, ParallelAccelerator is essentially a domain-specific compiler
written in Julia. It performs additional analysis and optimization on
top of the Julia compiler. ParallelAccelerator discovers and exploits
the implicit parallelism in source programs that use parallel
programming patterns such as map, reduce, comprehension, and stencil.
For example, Julia array operators such as .+, .-, .*, ./ are translated
by ParallelAccelerator internally into data-parallel map operations
over all elements of input arrays. For the most part, these patterns are
already present in standard Julia, so programmers can use
ParallelAccelerator to run the same Julia program without
(significantly) modifying the source code.
Version 0.2 should be considered an alpha release, suitable for early adopters and Julia enthusiasts. Please file bugs at <a href="https://travis-ci.org/IntelLabs/ParallelAccelerator.jl/issues" rel="nofollow" target="_blank" onmousedown="this.href='https://www.google.com/url?q\x3dhttps%3A%2F%2Ftravis-ci.org%2FIntelLabs%2FParallelAccelerator.jl%2Fissues\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGrrE6EYD1UfAMHgTBuPs5hIhm6xg';return true;" onclick="this.href='https://www.google.com/url?q\x3dhttps%3A%2F%2Ftravis-ci.org%2FIntelLabs%2FParallelAccelerator.jl%2Fissues\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNGrrE6EYD1UfAMHgTBuPs5hIhm6xg';return true;">https://travis-ci.org/IntelLabs/ParallelAccelerator.jl/issues .
See our GitHub repository at <a href="https://github.com/IntelLabs/ParallelAccelerator.jl" rel="nofollow" target="_blank" onmousedown="this.href='https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FIntelLabs%2FParallelAccelerator.jl\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNG8bQYqEynAw2JirTm8E_u2jY_ZdA';return true;" onclick="this.href='https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2FIntelLabs%2FParallelAccelerator.jl\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNG8bQYqEynAw2JirTm8E_u2jY_ZdA';return true;">https://github.com/IntelLabs/ParallelAccelerator.jl for a complete list of prerequisites, supported platforms, example programs, and documentation.
Thanks
to our colleagues at Intel and Intel Labs, the Julia team, and the
broader Julia community for their support of our efforts!
Best regards, The High Performance Scripting team (Parallel Computing Lab, Intel Labs)
|
|
I looked a bit deeper (i.e. found a machine where I have access to an Intel compiler, albeit not up to date - my shop is cursed by budget cuts). ICC breaks up a loop like for (i=0; i<n; i++) { a[i] = exp(cos(b[i])); s += a[i]; }
into calls to vector math library functions and a separate loop for the sum. The library is bundled with ICC; it's not MKL, but its domain overlaps with MKL - hence my misapprehension - so your point stands. Something like blackscholes benefits from these vector library calls, and GCC doesn't do that.
It would be nice if Julia's LLVM system included an optimization pass which invoked a vector math library when appropriate. I guess that's a challenge outside the scope of ParallelAccelerator, but maybe good ground for some other project. On Thursday, October 27, 2016 at 1:04:33 PM UTC-4, Todd Anderson wrote: That's interesting. I generally don't test with gcc and my experiments with ICC/C have shown something like 20% slower for LLVM/native threads for some class of benchmarks (like blackscholes) but 2-4x slower for some other benchmarks (like laplace-3d). The 20% may be attributable to ICC being better (including at vectorization like you mention) but certainly not the 2-4x. These larger differences are still under investigation. I guess something we have said in the docs or our postings have created this impression that our performance gains are somehow related to MKL or blas in general. If you have MKL then you can compile Julia to use it through its LLVM path. ParallelAccelerator does not insert calls to MKL where they didn't exist in the incoming IR and I don't think ICC does either. If MKL calls exist in the incoming IR then we don't modify them either. On Wednesday, October 26, 2016 at 7:51:33 PM UTC-7, Ralph Smith wrote: This is great stuff. Initial observations (under Linux/GCC) are that native threads are about 20% faster than OpenMP, so I surmise you are feeding LLVM some very tasty code. (I tested long loops with straightforward memory access.)
On the other hand, some of the earlier posts make me think that you were leveraging the strong vector optimization of the Intel C compiler and its tight coupling to MKL libraries. If so, is there any prospect of getting LLVM to take advantage of MKL?
On Wednesday, October 26, 2016 at 8:13:38 PM UTC-4, Todd Anderson wrote: Okay, METADATA with ParallelAccelerator verison 0.2 has been merged so if you do a standard Pkg.add() or update() you should get the latest version.
For native threads, please note that we've identified some issues with reductions and stencils that have been fixed and we will shortly be released in version 0.2.1. I will post here again when that release takes place.
Again, please give it a try and report back with experiences or file bugs.
thanks!
Todd
|
|