OpenBlas优化效果测试

我目前测试三个版本编译的OpenBlas效果,分别是32位单线程,64位单线程,32位多线程
测试为300次网络前向耗时,网络结构为[360, 1024, 1024, 1024, 1024, 1024, 4375]。

代码样例

基准代码

1
2
3
4
5
6
7
8
9
10
11
void TransRes_stdc(FLOATS *x, FLOATS *y, FLOATS *w, FLOATS *b, int npre, int npos)
{
FLOATS tmp = 0;
for(int i = 0; i < npos; i++)
{
tmp = 0;
for(int j = 0; j < npre; j++)
tmp += x[j] * w[i * npre + j];
y[i] = tmp + b[i];
}
}

OpenMP优化

1
2
3
4
5
6
7
8
9
10
11
12
void TransRes_stdc(FLOATS *x, FLOATS *y, FLOATS *w, FLOATS *b, int npre, int npos)
{
FLOATS tmp = 0;
#pragma omp parallel for
for(int i = 0; i < npos; i++)
{
tmp = 0;
for(int j = 0; j < npre; j++)
tmp += x[j] * w[i * npre + j];
y[i] = tmp + b[i];
}
}

OpenBlas内积优化

1
2
3
4
5
6
7
8
9
void TransRes_blas(FLOATS *x, FLOATS *y, FLOATS *w, FLOATS *b, int npre, int npos)
{
FLOATS tmp = 0;
for(int i = 0; i < npos; i++)
{
tmp = cblas_sdot(npre, x, 1, w + i * npre, 1);
y[i] = tmp + b[i];
}
}

OpenBlas内积+OpenMP优化

1
2
3
4
5
6
7
8
9
10
void TransRes_blas(FLOATS *x, FLOATS *y, FLOATS *w, FLOATS *b, int npre, int npos)
{
FLOATS tmp = 0;
#pragma omp parallel for
for(int i = 0; i < npos; i++)
{
tmp = cblas_sdot(npre, x, 1, w + i * npre, 1);
y[i] = tmp + b[i];
}
}

OpenBlas矩阵优化

1
2
3
4
5
void TransRes_blas(FLOATS *x, FLOATS *y, FLOATS *w, FLOATS *b, int npre, int npos)
{
cblas_scopy(npos, b, 1, y, 1);
cblas_sgemv(CblasRowMajor, CblasNoTrans, npos, npre, 1, w, npre, x, 1, 1, y, 1);
}

测试结果

  • 32位多线程
  1. 基准值:6400ms
  2. 内积优化:3900ms
  3. OpenMP优化:4000ms
  4. OpenMP+OpenBlas内积优化:1950ms *
  5. OpenMP矩阵优化:1950 - 2100ms
  • 32位单线程
  1. 基准值:6400ms
  2. 内积优化:3800ms
  3. OpenMP优化:3800ms
  4. OpenMP+OpenBlas内积优化:1800ms-2000ms不稳定
  5. OpenMP矩阵优化:2500ms
  • 64位单线程
  1. 基准值: 6400ms
  2. 内积优化:7800ms
  3. OpenMP+OpenBlas内积优化:4200ms
  4. OpenMP矩阵优化:3200ms