我目前测试三个版本编译的OpenBlas效果,分别是32位单线程,64位单线程,32位多线程
测试为300次网络前向耗时,网络结构为[360, 1024, 1024, 1024, 1024, 1024, 4375]。
代码样例
基准代码1
2
3
4
5
6
7
8
9
10
11void TransRes_stdc(FLOATS *x, FLOATS *y, FLOATS *w, FLOATS *b, int npre, int npos)
{
FLOATS tmp = 0;
for(int i = 0; i < npos; i++)
{
tmp = 0;
for(int j = 0; j < npre; j++)
tmp += x[j] * w[i * npre + j];
y[i] = tmp + b[i];
}
}
OpenMP优化1
2
3
4
5
6
7
8
9
10
11
12void TransRes_stdc(FLOATS *x, FLOATS *y, FLOATS *w, FLOATS *b, int npre, int npos)
{
FLOATS tmp = 0;
for(int i = 0; i < npos; i++)
{
tmp = 0;
for(int j = 0; j < npre; j++)
tmp += x[j] * w[i * npre + j];
y[i] = tmp + b[i];
}
}
OpenBlas内积优化1
2
3
4
5
6
7
8
9void TransRes_blas(FLOATS *x, FLOATS *y, FLOATS *w, FLOATS *b, int npre, int npos)
{
FLOATS tmp = 0;
for(int i = 0; i < npos; i++)
{
tmp = cblas_sdot(npre, x, 1, w + i * npre, 1);
y[i] = tmp + b[i];
}
}
OpenBlas内积+OpenMP优化1
2
3
4
5
6
7
8
9
10void TransRes_blas(FLOATS *x, FLOATS *y, FLOATS *w, FLOATS *b, int npre, int npos)
{
FLOATS tmp = 0;
for(int i = 0; i < npos; i++)
{
tmp = cblas_sdot(npre, x, 1, w + i * npre, 1);
y[i] = tmp + b[i];
}
}
OpenBlas矩阵优化1
2
3
4
5void TransRes_blas(FLOATS *x, FLOATS *y, FLOATS *w, FLOATS *b, int npre, int npos)
{
cblas_scopy(npos, b, 1, y, 1);
cblas_sgemv(CblasRowMajor, CblasNoTrans, npos, npre, 1, w, npre, x, 1, 1, y, 1);
}
测试结果
- 32位多线程
- 基准值:6400ms
- 内积优化:3900ms
- OpenMP优化:4000ms
- OpenMP+OpenBlas内积优化:1950ms *
- OpenMP矩阵优化:1950 - 2100ms
- 32位单线程
- 基准值:6400ms
- 内积优化:3800ms
- OpenMP优化:3800ms
- OpenMP+OpenBlas内积优化:1800ms-2000ms不稳定
- OpenMP矩阵优化:2500ms
- 64位单线程
- 基准值: 6400ms
- 内积优化:7800ms
- OpenMP+OpenBlas内积优化:4200ms
- OpenMP矩阵优化:3200ms