当前位置:网站首页>Odeint et GPU
Odeint et GPU
2022-07-01 04:22:00 【Feisy】
1 Lien original
https://www.boost.org/doc/libs/1_55_0/libs/numeric/odeint/doc/html/boost_numeric_odeint/tutorial/using_cuda__or_openmp__tbb_______via_thrust.html
2 Introduction(Ne pas utiliser pour les petits problèmesGPU)
Modern graphic cards (graphic processing units - GPUs) can be used to speed up the performance of time consuming algorithms by means of massive parallelization. They are designed to execute many operations in parallel. odeint can utilize the power of GPUs by means of CUDA and Thrust, which is a STL-like interface for the native CUDA API.
Thrust also supports parallelization using OpenMP and Intel Threading Building Blocks (TBB).
You can switch between CUDA, OpenMP and TBB parallelizations by a simple compiler switch.
Hence, this also provides an easy way to get basic OpenMP parallelization into odeint.
The examples discussed below are focused on GPU parallelization, though.
To use odeint with CUDA a few points have to be taken into account. First of all, the problem has to be well chosen. It makes absolutely no sense to try to parallelize the code for a three dimensional system, it is simply too small and not worth the effort. One single function call (kernel execution) on the GPU is slow but you can do the operation on a huge set of data with only one call. We have experienced that the vector size over which is parallelized should be of the order of 106 to make full use of the GPU. Secondly, you have to use Thrust’s algorithms and functors when implementing the rhs the ODE. This might be tricky since it involves some kind of functional programming knowledge.
Typical applications for CUDA and odeint are large systems, like lattices or discretizations of PDE, and parameter studies. We introduce now three examples which show how the power of GPUs can be used in combination with odeint.
The full power of CUDA is only available for really large systems where the number of coupled ordinary differential equations is of order N=106 or larger.
For smaller systems the CPU is usually much faster.
You can also integrate an ensemble of different uncoupled ODEs in parallel as shown in the last example.
3 Phase oscillator ensemble Comment utiliserGPUProcessus approximatif de
The first example is the phase oscillator ensemble from the previous section:

It has a phase transition at ε = 2 in the limit of infinite numbers of oscillators N. In the case of finite N this transition is smeared out but still clearly visible.
Thrust and CUDA are perfectly suited for such kinds of problems where one needs a large number of particles (oscillators). We start by defining the state type which is a thrust::device_vector. The content of this vector lives on the GPU. If you are not familiar with this we recommend reading the Getting started section on the Thrust website.
//change this to float if your device does not support double computation
typedef double value_type;
//change this to host_vector< ... > of you want to run on CPU
typedef thrust::device_vector< value_type > state_type;
// typedef thrust::host_vector< value_type > state_type;
Thrust follows a functional programming approach. If you want to perform a calculation on the GPU you usually have to call a global function like thrust::for_each, thrust::reduce, … with an appropriate local functor which performs the basic operation. An example is
struct add_two
{
template< class T >
__host__ __device__
void operator()( T &t ) const
{
t += T( 2 );
}
};
// ...
thrust::for_each( x.begin() , x.end() , add_two() );
This code generically adds two to every element in the container x.
For the purpose of integrating the phase oscillator ensemble we need
- to calculate the system function, hence the r.h.s. of the ODE.
- this involves computing the mean field of the oscillator example, i.e. the values of R and θ
The mean field is calculated in a class mean_field_calculator
The mean field is calculated in a class mean_field_calculator
struct mean_field_calculator
{
struct sin_functor : public thrust::unary_function< value_type , value_type >
{
__host__ __device__
value_type operator()( value_type x) const
{
return sin( x );
}
};
struct cos_functor : public thrust::unary_function< value_type , value_type >
{
__host__ __device__
value_type operator()( value_type x) const
{
return cos( x );
}
};
static std::pair< value_type , value_type > get_mean( const state_type &x )
{
value_type sin_sum = thrust::reduce(
thrust::make_transform_iterator( x.begin() , sin_functor() ) ,
thrust::make_transform_iterator( x.end() , sin_functor() ) );
value_type cos_sum = thrust::reduce(
thrust::make_transform_iterator( x.begin() , cos_functor() ) ,
thrust::make_transform_iterator( x.end() , cos_functor() ) );
cos_sum /= value_type( x.size() );
sin_sum /= value_type( x.size() );
value_type K = sqrt( cos_sum * cos_sum + sin_sum * sin_sum );
value_type Theta = atan2( sin_sum , cos_sum );
return std::make_pair( K , Theta );
}
};
Inside this class two member structures sin_functor and cos_functor are defined. They compute the sine and the cosine of a value and they are used within a transform iterator to calculate the sum of sin(φk) and cos(φk). The classifiers host and device are CUDA specific and define a function or operator which can be executed on the GPU as well as on the CPU. The line
value_type sin_sum = thrust::reduce(
thrust::make_transform_iterator( x.begin() , sin_functor() ) ,
thrust::make_transform_iterator( x.end() , sin_functor() ) );
performs the calculation of this sine-sum on the GPU (or on the CPU, depending on your thrust configuration).
The system function is defined via
class phase_oscillator_ensemble
{
public:
struct sys_functor
{
value_type m_K , m_Theta , m_epsilon;
sys_functor( value_type K , value_type Theta , value_type epsilon )
: m_K( K ) , m_Theta( Theta ) , m_epsilon( epsilon ) { }
template< class Tuple >
__host__ __device__
void operator()( Tuple t )
{
thrust::get<2>(t) = thrust::get<1>(t) + m_epsilon * m_K * sin( m_Theta - thrust::get<0>(t) );
}
};
// ...
void operator() ( const state_type &x , state_type &dxdt , const value_type dt ) const
{
std::pair< value_type , value_type > mean_field = mean_field_calculator::get_mean( x );
thrust::for_each(
thrust::make_zip_iterator( thrust::make_tuple( x.begin() , m_omega.begin() , dxdt.begin() ) ),
thrust::make_zip_iterator( thrust::make_tuple( x.end() , m_omega.end() , dxdt.end()) ) ,
sys_functor( mean_field.first , mean_field.second , m_epsilon )
);
}
// ...
};
Now, we are ready to put everything together. All we have to do for making odeint ready for using the GPU is to parametrize the stepper with the appropriate thrust algebra/operations:
typedef runge_kutta4< state_type , value_type , state_type , value_type , thrust_algebra , thrust_operations > stepper_type;
You can also use a controlled or dense output stepper, e.g.
typedef runge_kutta_dopri5< state_type , value_type , state_type , value_type , thrust_algebra , thrust_operations > stepper_type;
Then, it is straightforward to integrate the phase ensemble by creating an instance of the rhs class and using an integration function:
phase_oscillator_ensemble ensemble( N , 1.0 );
size_t steps1 = integrate_const( make_controlled( 1.0e-6 , 1.0e-6 , stepper_type() ) , boost::ref( ensemble ) , x , 0.0 , t_transients , dt );
We have to use boost::ref here in order to pass the rhs class as reference and not by value. This ensures that the natural frequencies of each oscillator are not copied when calling integrate_const. In the full example the performance and results of the Runge-Kutta-4 and the Dopri5 solver are compared.
The full example can be found at phase_oscillator_example.cu
D'autres articles sont déjà codés :
1
https://archive.fosdem.org/2013/schedule/event/odes_cuda_opencl/attachments/slides/181/export/events/attachments/odes_cuda_opencl/slides/181/1430_Karsten.pdf
https://github.com/ddemidov/gpgpu_with_modern_cpp
边栏推荐
- Leetcode learning - day 36
- TS type gymnastics: illustrating a complex advanced type
- [recommended algorithm] C interview question of a small factory
- LeetCode 1400. Construct K palindrome strings
- Redis (VII) optimization suggestions
- Ospfb notes - five messages [ultra detailed] [Hello message, DD message, LSR message, LSU message, lsack message]
- 2. Use of classlist (element class name)
- All in one 1086: Jiaogu conjecture
- 一些小知识点
- Analysis and case of pageobject mode
猜你喜欢

多次跳槽后,月薪等于老同事的年薪
![[TA frost wolf \u may- hundred people plan] 1.2.1 vector basis](/img/94/99090ea91082a385968e071ef3766c.png)
[TA frost wolf \u may- hundred people plan] 1.2.1 vector basis

Knowledge supplement: redis' basic data types and corresponding commands
![Ospfb notes - five messages [ultra detailed] [Hello message, DD message, LSR message, LSU message, lsack message]](/img/aa/a255d225d71e6ba2b497f8d59f5f11.jpg)
Ospfb notes - five messages [ultra detailed] [Hello message, DD message, LSR message, LSU message, lsack message]

MySQL winter vacation self-study 2022 12 (5)

Unity's 3D multi-point arrow navigation

尺取法:有效三角形的个数

283. move zero

206. reverse linked list
[today in history] June 30: von Neumann published the first draft; The semiconductor war in the late 1990s; CBS acquires CNET
随机推荐
熊市下的Coinbase:亏损、裁员、股价暴跌
Task04 | statistiques mathématiques
Offline installation of Wireshark 2.6.10
ThreeJS开篇
[send email with error] 535 error:authentication failed
CF1638E colorful operations
这可能是你进腾讯最后的机会了..
互联网行业最佳产品开发流程 推荐!
How to ensure the idempotency of the high concurrency interface?
Ospfb notes - five messages [ultra detailed] [Hello message, DD message, LSR message, LSU message, lsack message]
【LeetCode】100. Same tree
Note de développement du système embarqué 80: application du concepteur Qt à la conception de l'interface principale
Some small knowledge points
Web components series (VIII) -- custom component style settings
PageObject模式解析及案例
JMeter learning notes 2 - brief introduction to graphical interface
【人话版】WEB3黑暗森林中的隐私博弈
js 图片路径转换base64格式
小程序中自定义组件
什么是uid?什么是Auth?什么是验证器?