当前位置：网站首页>[w806 drummer's notes]fpu performance simple test - May 23, 2022

[w806 drummer's notes]fpu performance simple test - May 23, 2022

2022-06-09 04:19:00 【ZZZ_ XXJ】

W806 It's a safe IoT MCU chip . Chip integration 32 position CPU processor , built-in UART、GPIO、SPI、SDIO、 I2C、I2S、PSRAM、7816、ADC、LCD、TouchSensor Equal digital interface ; Support TEE Security engine , Support a variety of hardware encryption and decryption algorithms , built-in DSP、 Floating point units and security engines , Support code security permission settings , built-in 1MB Flash Memory , Support firmware encryption storage 、 Firmware signature 、 Security debugging 、 Security upgrade and other security measures , Ensure product safety characteristics . Suitable for small household appliances 、 Smart home 、 Intelligent toy 、 Industrial control 、 Medical care and other extensive areas of the Internet of things .

FPU brief introduction

The following is excerpted from 《 Darksteel E804 User's Manual _v04》.

Floating point units act as E804 Configurable hardware unit of , Designed to improve E804 Processing power for floating point applications .E804 Floating point units provide a low cost 、 High performance hardware floating point implementation .
Floating point units support IEEE-754 Single precision floating-point operation in floating-point standard , Realized 16 A single precision floating-point register . Supported by the system software ,E804 It can support double precision floating-point operation .

The main characteristics of the architecture and programming model of floating-point unit are as follows ：

Fully compatible with ANSI/IEEE Std 754 Floating point standard （ Supported by system software ）;
Only single precision floating-point operations are supported ;
Rounding to zero is supported 、 Round to infinity 、 Rounding to negative infinity and rounding to the nearest ;
It supports two processing modes of floating-point exceptions, trapping and non trapping ;
Support the precise handling of floating-point exceptions ;
Support floating-point hardware division and square root .

The main features of the microarchitecture of floating-point cells are as follows ：

16 Separate single precision floating-point registers ;
Single emitting structure , One floating-point arithmetic instruction per cycle ;
Support sequential emission of floating-point arithmetic instructions 、 Execute in order 、 Write back in sequence ;
It contains three independent execution pipelines , They are floating point ALU、 Floating point multiplication and floating point division ;
Optimized execution delay technology , Except for floating-point division and square root instruction , Can be found in 1-2 Clock cycles have been completed ;
Cost optimization technology based on operation component reuse ;
Power optimization technology based on gated clock and data path isolation .

Test project

Basic algorithm

Floating point and floating point addition, subtraction, multiplication and division
Trigonometric and anti trigonometric functions
Square root
e At the bottom of the X The next power
X Of Y The next power

Compound algorithm

100 Point first-order low-pass filtering
100 Point sine curve value
32*32 Pixels RGB Go gray

The test method

Record closing separately FPU And on FPU when ,10ms The number of algorithm executions in the timing period , The higher the value, the better .
All tests use single precision floating point , Code -O3 Optimize , In order to minimize the additional overhead of the loop , Manually expand the innermost layer of the cycle .

test result

Basic algorithm

Floating point and floating point addition, subtraction, multiplication and division

__IO uint32_t cnt = 0;
__IO float a = 1.1f;
__IO float b = 0.123456f;
__IO float c;

while(1)
{
    
	c = a + b;
	c = a + b;
	c = a + b;
	c = a + b;
	cnt += 4;
}

give the result as follows , Subtraction 、 Multiplication 、 Division is tested in the same way as addition , So the test code is not released .

Algorithm	close FPU( Time /10ms)	open FPU( Time /10ms)	Frequency ratio
c = a + b	12708	319268	25.12
c = a - b	11116	319268	28.72
c = a * b	18640	319268	17.13
c = a / b	5592	69412	12.41

Trigonometric and anti trigonometric functions

Algorithm	close FPU( Time /10ms)	open FPU( Time /10ms)	Frequency ratio
c = sinf(a)	440	14068	31.97
c = cosf(a)	480	15628	32.56
c = tanf(a)	236	9268	39.27
c = asinf(a)	6612	41112	6.22
c = acosf(a)	6668	37860	5.68
c = atanf(a)	336	3816	11.36

Square root

while(1)
{
    
	c = sqrtf(a);
	c = sqrtf(a);
	c = sqrtf(a);
	c = sqrtf(a);
	cnt += 4;
}

Algorithm	close FPU( Time /10ms)	open FPU( Time /10ms)	Frequency ratio
sqrtf()	7364	57704	7.84

e At the bottom of the X The next power

while(1)
{
    
	c = expf(a);
	c = expf(a);
	c = expf(a);
	c = expf(a);
	cnt += 4;
}

Algorithm	close FPU( Time /10ms)	open FPU( Time /10ms)	Frequency ratio
expf()	444	14404	32.44

X Of Y The next power

while(1)
{
    
	c = powf(a, b);
	c = powf(a, b);
	c = powf(a, b);
	c = powf(a, b);
	cnt += 4;
}

Algorithm	close FPU( Time /10ms)	open FPU( Time /10ms)	Frequency ratio
powf()	124	5308	42.8

Compound algorithm

100 Point first-order low-pass filtering

__IO float t1[100];
__IO float t2;

static inline float first_oder_filter(float new_data)
{
    
  static float old_data;
  float a = 0.05f;
  old_data = a * new_data + (1.0f - a) * old_data;
  return old_data;
}

int main(void)
{
    
	SystemClock_Config(CPU_CLK_240M);

	for (uint32_t i = 0; i < 100; i++)
	{
    
		t1[i] = rand() / (float)(RAND_MAX / 0xffff);
	}
	while (1)
	{
    
		for (uint32_t i = 0; i < 100; i += 4)
		{
    
			t2 = first_oder_filter(t1[i]);
			t2 = first_oder_filter(t1[i + 1]);
			t2 = first_oder_filter(t1[i + 2]);
			t2 = first_oder_filter(t1[i + 3]);
		}
		cnt++;
	}

	return 0;
}

close FPU( Time /10ms)	open FPU( Time /10ms)	Frequency ratio
37	1940	52.43

100 Point sine curve

int main(void)
{
    
	SystemClock_Config(CPU_CLK_240M);

	/*  Sine formula ：y=Asin(ωx+ψ)+k */
	__IO float W = 3.1415926f / 50.f;
	__IO float A = 100.0f;
	__IO float k = 50.0f;
	__IO float offset = 6.0f;
	__IO float wave;
	__IO float x[100] = {
     0.0f };
	
	float temp = 0.0f;
	for (uint32_t i = 0; i < 100; i++)
	{
    
		x[i] = temp + 0.05f;
	}

	while (1)
	{
    
		for (uint32_t i = 0; i < 100; i += 4)
		{
    
			wave = A * sinf(W * x[i] + offset) + k;
			wave = A * sinf(W * x[i + 1] + offset) + k;
			wave = A * sinf(W * x[i + 2] + offset) + k;
			wave = A * sinf(W * x[i + 3] + offset) + k;
		}
		cnt++;
	}

	return 0;
}

close FPU( Time /10ms)	open FPU( Time /10ms)	Frequency ratio
4	116	29.0

32*32 Pixels RGB Go gray

int main(void)
{
    
	SystemClock_Config(CPU_CLK_240M);

	__IO uint32_t cnt = 0;
	__IO uint8_t color_r[32][32];
	__IO uint8_t color_g[32][32];
	__IO uint8_t color_b[32][32];
	__IO uint8_t color_gray[32][32];

	for (uint32_t i = 0; i < 32; i++)
	{
    
		for (uint32_t j = 0; j < 32; j++)
		{
    
			color_r[i][j] = rand() % 0xff;
			color_g[i][j] = rand() % 0xff;
			color_b[i][j] = rand() % 0xff;
		}
	}

	while (1)
	{
    
		for (uint32_t i = 0; i < 32; i++)
		{
    
			for (uint32_t j = 0; j < 32; j += 4)
			{
    
				color_gray[i][j] = color_r[i][j] * 0.299f + color_g[i][j] * 0.587f + color_b[i][j] * 0.114f;
				color_gray[i][j + 1] = color_r[i][j + 1] * 0.299f + color_g[i][j + 1] * 0.587f + color_b[i][j + 1] * 0.114f;
				color_gray[i][j + 2] = color_r[i][j + 2] * 0.299f + color_g[i][j + 2] * 0.587f + color_b[i][j + 2] * 0.114f;
				color_gray[i][j + 3] = color_r[i][j + 3] * 0.299f + color_g[i][j + 3] * 0.587f + color_b[i][j + 3] * 0.114f;
			}
		}
		cnt++;
	}

	return 0;
}

close FPU( Time /10ms)	open FPU( Time /10ms)	Frequency ratio
2	61	30.5

summary

Through the above test results, it can be found that ,W806 Of XT804 The kernel is not FPU The floating-point number computing power is very weak , and FPU The addition of can greatly improve the computing power of the kernel for single precision floating point , Whether it is basic algorithm or composite algorithm , Almost all of them have been improved dozens of times .W806 The project is enabled by default FPU Of , Just use it directly .