OpenBlas优化效果测试

我目前测试三个版本编译的OpenBlas效果，分别是32位单线程，64位单线程，32位多线程
测试为300次网络前向耗时，网络结构为[360, 1024, 1024, 1024, 1024, 1024, 4375]。

代码样例

基准代码

void TransRes_stdc(FLOATS *x, FLOATS *y, FLOATS *w, FLOATS *b, int npre, int npos)
{
	FLOATS tmp = 0;
    for(int i = 0; i < npos; i++)
	{
		tmp = 0;
        for(int j = 0; j < npre; j++)
			tmp += x[j] * w[i * npre + j];
		y[i] = tmp + b[i];
	}
}

OpenMP优化

void TransRes_stdc(FLOATS *x, FLOATS *y, FLOATS *w, FLOATS *b, int npre, int npos)
{
	FLOATS tmp = 0;
#pragma omp parallel for
    for(int i = 0; i < npos; i++)
	{
		tmp = 0;
        for(int j = 0; j < npre; j++)
			tmp += x[j] * w[i * npre + j];
		y[i] = tmp + b[i];
	}
}

OpenBlas内积优化

void TransRes_blas(FLOATS *x, FLOATS *y, FLOATS *w, FLOATS *b, int npre, int npos)
{
    FLOATS tmp = 0;
    for(int i = 0; i < npos; i++)
    {
        tmp = cblas_sdot(npre, x, 1, w + i * npre, 1);
        y[i] = tmp + b[i];
    }
}

OpenBlas内积+OpenMP优化

void TransRes_blas(FLOATS *x, FLOATS *y, FLOATS *w, FLOATS *b, int npre, int npos)
{
    FLOATS tmp = 0;
#pragma omp parallel for
    for(int i = 0; i < npos; i++)
    {
        tmp = cblas_sdot(npre, x, 1, w + i * npre, 1);
        y[i] = tmp + b[i];
    }
}

OpenBlas矩阵优化

void TransRes_blas(FLOATS *x, FLOATS *y, FLOATS *w, FLOATS *b, int npre, int npos)
{
    cblas_scopy(npos, b, 1, y, 1);
    cblas_sgemv(CblasRowMajor, CblasNoTrans, npos, npre, 1, w, npre, x, 1, 1, y, 1);
}

测试结果

32位多线程

基准值：6400ms
内积优化：3900ms
OpenMP优化：4000ms
OpenMP+OpenBlas内积优化：1950ms *
OpenMP矩阵优化：1950 - 2100ms

32位单线程

基准值：6400ms
内积优化：3800ms
OpenMP优化：3800ms
OpenMP+OpenBlas内积优化：1800ms-2000ms不稳定
OpenMP矩阵优化：2500ms

64位单线程

基准值： 6400ms
内积优化：7800ms
OpenMP+OpenBlas内积优化：4200ms
OpenMP矩阵优化：3200ms