这里是前段时间写的csapp的另一个实验perflab的解题过程喵~

本实验要求优化handout里kernel.c中的两个函数,rotate函数的作用是将图像逆时针旋转90°,smooth函数的作用是对于图像中的每一个像素点,取它和周围的像素点的平均值,让图片变得模糊。下面让我来逐一优化他们^_^

1. 优化rotate函数

原始的rotate函数如下:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
/* 
 * naive_rotate - The naive baseline version of rotate 
 */
char naive_rotate_descr[] = "naive_rotate: Naive baseline implementation";
void naive_rotate(int dim, pixel *src, pixel *dst) 
{
    int i, j;

    for (i = 0; i < dim; i++)
    for (j = 0; j < dim; j++)
        dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
}

代码简单易懂,就是用了一个双层循环将每个像素点变换到对应的地方去。然而分析一下代码就能发现一个十分简单的优化方法:因为在最内层循环中,j的值每次都会改变,所以每执行一次赋值就要计算一次dim-1-j,算多了自然就慢了。我们可以利用简单的数学技巧改写公式,将赋值语句改成dst[RIDX(i, j, dim)] = src[RIDX(j, dim-i-1, dim)]; 这样就不用每次都计算了。 优化代码如下:

1
2
3
4
5
6
7
8
char naive_rotate_descr2[] = "naive_rotate2: only change the place of i and j";
void naive_rotate2(int dim, pixel *src, pixel *dst) 
{
    int i, j;
    for (i = 0; i < dim; i++)
    for (j = 0; j < dim; j++)
        dst[RIDX(i, j, dim)] = src[RIDX(j, dim-i-1, dim)];//i change less 
}

测试结果如下,可见运算速度有了极大提升:

lab4-1

这是一种最为简单的优化方案,再分析一下题目给的条件,实验指导书中说为了让生活更美好,假定了测试用的图片维数N均是32的倍数,那么我们可以考虑将图片分割成32x32的小块来进行变换,进一步加速程序。(后来发现没有这个假定真的要出人命呀根本不好优化的说也许只是我智商太低TAT)

下面是优化代码:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
char rotate_block_descr[] = "rotate block: change 32*32 block each time";
void rotate_block(int dim, pixel *src, pixel *dst) 
{
    int i,j,i1,j1=0;
    //32*32 block size
    int block_size=32;
    for(i=0;i<dim;i+=block_size)
        for(j=0;j<dim;j+=block_size){
            for(i1=i;i1<i+block_size;i1++)
                for(j1=j;j1<j+block_size;j1++)
                    dst[RIDX(i1, j1, dim)] = src[RIDX(j1, dim-i1-1, dim)];//i1 change less
        }
}

下面是测试结果,可见效果有了进一步提升:

lab4-2

然而我此时并不满足,因为CSAPP书上有提到过并行的方法,充分利用计算机的流水结构来提高程序速率,本函数可以分32路并行来写,利用数学知识分析一下变换就可以得到以下代码(dst += (dim - 1) * dim;一句是因为后面有dst -= 31; dst -= dim;这个,不增加dst的值可能会导致dst指针指向我们不希望的地址,可能会指向原图的地方):

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
char rotate_descr[]= "rotate: Current working version";
void rotate(int dim, pixel *src, pixel *dst){
    int i,j=0;
    dst += (dim-1)*dim;//prevent *dst point to strange place
    for(i=0;i<dim;i+=32){
        for(j=0;j<dim;j++){

            *dst=*src;
            dst++;src+=dim;

            *dst=*src;
            dst++;src+=dim;

            *dst=*src;
            dst++;src+=dim;

            *dst=*src;
            dst++;src+=dim;

            *dst=*src;
            dst++;src+=dim;

            *dst=*src;
            dst++;src+=dim;

            *dst=*src;
            dst++;src+=dim;

            *dst=*src;
            dst++;src+=dim;

            *dst=*src;
            dst++;src+=dim;

            *dst=*src;
            dst++;src+=dim;

            *dst=*src;
            dst++;src+=dim;

            *dst=*src;
            dst++;src+=dim;

            *dst=*src;
            dst++;src+=dim;

            *dst=*src;
            dst++;src+=dim;

            *dst=*src;
            dst++;src+=dim;

            *dst=*src;
            dst++;src+=dim;

            *dst=*src;
            dst++;src+=dim;

            *dst=*src;
            dst++;src+=dim;

            *dst=*src;
            dst++;src+=dim;

            *dst=*src;
            dst++;src+=dim;

            *dst=*src;
            dst++;src+=dim;

            *dst=*src;
            dst++;src+=dim;

            *dst=*src;
            dst++;src+=dim;

            *dst=*src;
            dst++;src+=dim;

            *dst=*src;
            dst++;src+=dim;

            *dst=*src;
            dst++;src+=dim;

            *dst=*src;
            dst++;src+=dim;

            *dst=*src;
            dst++;src+=dim;

            *dst=*src;
            dst++;src+=dim;

            *dst=*src;
            dst++;src+=dim;

            *dst=*src;
            dst++;src+=dim;

            *dst=*src;
            //dst++;src+=dim;

            src-=31*dim;src++;
            dst-=31;dst-=dim;
        }
        src+=31*dim;
        dst+=32;dst+=dim*dim;
    }
}

测试结果如下,程序性能再次提高:

lab4-3

也许还有更好的优化方法,但我已经想不出啦!有更好的方案欢迎comment一下啦O(∩_∩)O

2. 优化smooth函数

Smooth函数的原始代码如下:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
/*
 * naive_smooth - The naive baseline version of smooth 
 */
char naive_smooth_descr[] = "naive_smooth: Naive baseline implementation";
void naive_smooth(int dim, pixel *src, pixel *dst) 
{
    int i, j;

    for (i = 0; i < dim; i++)
    for (j = 0; j < dim; j++)
        dst[RIDX(i, j, dim)] = avg(dim, i, j, src);
}

代码也十分简单易懂,但有个十分严重的问题,每次执行的时候都调用了avg函数,观察avg函数的代码又发现它调用了其他的函数,这样频繁的函数调用就决定了程序运行的速度,所以要优化函数,首先在smooth函数中不要调用avg()。自己写求平均的函数,要分三种情况,一是图像的四个角落,二是四条边界上的点,三是一般情况。

下面是优化代码:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
char smooth0_descr[] = "smooth0: test version";
void smooth0(int dim, pixel *src, pixel *dst) 
{
    int i,j;
    //no using avg()

    //corners
    dst[RIDX(0,0,dim)].red=(src[RIDX(0,0,dim)].red+src[RIDX(1,0,dim)].red+src[RIDX(0,1,dim)].red+src[RIDX(1,1,dim)].red)>>2;
    dst[RIDX(0,0,dim)].blue=(src[RIDX(0,0,dim)].blue+src[RIDX(1,0,dim)].blue+src[RIDX(0,1,dim)].blue+src[RIDX(1,1,dim)].blue)>>2;
    dst[RIDX(0,0,dim)].green=(src[RIDX(0,0,dim)].green+src[RIDX(1,0,dim)].green+src[RIDX(0,1,dim)].green+src[RIDX(1,1,dim)].green)>>2;

    dst[RIDX(0,dim-1,dim)].red=(src[RIDX(0,dim-1,dim)].red+src[RIDX(1,dim-1,dim)].red+src[RIDX(0,dim-2,dim)].red+src[RIDX(1,dim-2,dim)].red)>>2;
    dst[RIDX(0,dim-1,dim)].blue=(src[RIDX(0,dim-1,dim)].blue+src[RIDX(1,dim-1,dim)].blue+src[RIDX(0,dim-2,dim)].blue+src[RIDX(1,dim-2,dim)].blue)>>2;
    dst[RIDX(0,dim-1,dim)].green=(src[RIDX(0,dim-1,dim)].green+src[RIDX(1,dim-1,dim)].green+src[RIDX(0,dim-2,dim)].green+src[RIDX(1,dim-2,dim)].green)>>2;

    dst[RIDX(dim-1,0,dim)].red=(src[RIDX(dim-1,0,dim)].red+src[RIDX(dim-2,0,dim)].red+src[RIDX(dim-1,1,dim)].red+src[RIDX(dim-2,1,dim)].red)>>2;
    dst[RIDX(dim-1,0,dim)].blue=(src[RIDX(dim-1,0,dim)].blue+src[RIDX(dim-2,0,dim)].blue+src[RIDX(dim-1,1,dim)].blue+src[RIDX(dim-2,1,dim)].blue)>>2;
    dst[RIDX(dim-1,0,dim)].green=(src[RIDX(dim-1,0,dim)].green+src[RIDX(dim-2,0,dim)].green+src[RIDX(dim-1,1,dim)].green+src[RIDX(dim-2,1,dim)].green)>>2;

    dst[RIDX(dim-1,dim-1,dim)].red=(src[RIDX(dim-1,dim-1,dim)].red+src[RIDX(dim-1,dim-2,dim)].red+src[RIDX(dim-2,dim-1,dim)].red+src[RIDX(dim-2,dim-2,dim)].red)>>2;
    dst[RIDX(dim-1,dim-1,dim)].blue=(src[RIDX(dim-1,dim-1,dim)].blue+src[RIDX(dim-1,dim-2,dim)].blue+src[RIDX(dim-2,dim-1,dim)].blue+src[RIDX(dim-2,dim-2,dim)].blue)>>2;
    dst[RIDX(dim-1,dim-1,dim)].green=(src[RIDX(dim-1,dim-1,dim)].green+src[RIDX(dim-1,dim-2,dim)].green+src[RIDX(dim-2,dim-1,dim)].green+src[RIDX(dim-2,dim-2,dim)].green)>>2;

    //boarder
    for(i=1;i<dim-1;i++){
        dst[RIDX(i,0,dim)].red=(src[RIDX(i,0,dim)].red+src[RIDX(i-1,0,dim)].red+src[RIDX(i-1,1,dim)].red+src[RIDX(i,1,dim)].red+src[RIDX(i+1,0,dim)].red+src[RIDX(i+1,1,dim)].red)/6;
        dst[RIDX(i,0,dim)].blue=(src[RIDX(i,0,dim)].blue+src[RIDX(i-1,0,dim)].blue+src[RIDX(i-1,1,dim)].blue+src[RIDX(i,1,dim)].blue+src[RIDX(i+1,0,dim)].blue+src[RIDX(i+1,1,dim)].blue)/6;
        dst[RIDX(i,0,dim)].green=(src[RIDX(i,0,dim)].green+src[RIDX(i-1,0,dim)].green+src[RIDX(i-1,1,dim)].green+src[RIDX(i,1,dim)].green+src[RIDX(i+1,0,dim)].green+src[RIDX(i+1,1,dim)].green)/6;
    }

    for(i=1;i<dim-1;i++){
        dst[RIDX(i,dim-1,dim)].red=(src[RIDX(i,dim-1,dim)].red+src[RIDX(i-1,dim-1,dim)].red+src[RIDX(i-1,dim-2,dim)].red+src[RIDX(i,dim-2,dim)].red+src[RIDX(i+1,dim-1,dim)].red+src[RIDX(i+1,dim-2,dim)].red)/6;
        dst[RIDX(i,dim-1,dim)].blue=(src[RIDX(i,dim-1,dim)].blue+src[RIDX(i-1,dim-1,dim)].blue+src[RIDX(i-1,dim-2,dim)].blue+src[RIDX(i,dim-2,dim)].blue+src[RIDX(i+1,dim-1,dim)].blue+src[RIDX(i+1,dim-2,dim)].blue)/6;
        dst[RIDX(i,dim-1,dim)].green=(src[RIDX(i,dim-1,dim)].green+src[RIDX(i-1,dim-1,dim)].green+src[RIDX(i-1,dim-2,dim)].green+src[RIDX(i,dim-2,dim)].green+src[RIDX(i+1,dim-1,dim)].green+src[RIDX(i+1,dim-2,dim)].green)/6;
    }

    for(j=1;j<dim-1;j++){
        dst[RIDX(0,j,dim)].red=(src[RIDX(0,j,dim)].red+src[RIDX(0,j-1,dim)].red+src[RIDX(1,j-1,dim)].red+src[RIDX(1,j,dim)].red+src[RIDX(0,j+1,dim)].red+src[RIDX(1,j+1,dim)].red)/6;
        dst[RIDX(0,j,dim)].blue=(src[RIDX(0,j,dim)].blue+src[RIDX(0,j-1,dim)].blue+src[RIDX(1,j-1,dim)].blue+src[RIDX(1,j,dim)].blue+src[RIDX(0,j+1,dim)].blue+src[RIDX(1,j+1,dim)].blue)/6;
        dst[RIDX(0,j,dim)].green=(src[RIDX(0,j,dim)].green+src[RIDX(0,j-1,dim)].green+src[RIDX(1,j-1,dim)].green+src[RIDX(1,j,dim)].green+src[RIDX(0,j+1,dim)].green+src[RIDX(1,j+1,dim)].green)/6;
    }

    for(j=1;j<dim-1;j++){
        dst[RIDX(dim-1,j,dim)].red=(src[RIDX(dim-1,j,dim)].red+src[RIDX(dim-1,j+1,dim)].red+src[RIDX(dim-1,j-1,dim)].red+src[RIDX(dim-2,j,dim)].red+src[RIDX(dim-2,j+1,dim)].red+src[RIDX(dim-2,j-1,dim)].red)/6;
        dst[RIDX(dim-1,j,dim)].blue=(src[RIDX(dim-1,j,dim)].blue+src[RIDX(dim-1,j+1,dim)].blue+src[RIDX(dim-1,j-1,dim)].blue+src[RIDX(dim-2,j,dim)].blue+src[RIDX(dim-2,j+1,dim)].blue+src[RIDX(dim-2,j-1,dim)].blue)/6;
        dst[RIDX(dim-1,j,dim)].green=(src[RIDX(dim-1,j,dim)].green+src[RIDX(dim-1,j+1,dim)].green+src[RIDX(dim-1,j-1,dim)].green+src[RIDX(dim-2,j,dim)].green+src[RIDX(dim-2,j+1,dim)].green+src[RIDX(dim-2,j-1,dim)].green)/6;
    }

    //common
    for(i=1;i<dim-1;i++)
        for(j=1;j<dim-1;j++){
            dst[RIDX(i,j,dim)].red=(src[RIDX(i,j,dim)].red+src[RIDX(i+1,j,dim)].red+src[RIDX(i-1,j,dim)].red+src[RIDX(i,j-1,dim)].red+src[RIDX(i+1,j-1,dim)].red+src[RIDX(i-1,j-1,dim)].red+src[RIDX(i,j+1,dim)].red+src[RIDX(i+1,j+1,dim)].red+src[RIDX(i-1,j+1,dim)].red)/9;
            dst[RIDX(i,j,dim)].blue=(src[RIDX(i,j,dim)].blue+src[RIDX(i+1,j,dim)].blue+src[RIDX(i-1,j,dim)].blue+src[RIDX(i,j-1,dim)].blue+src[RIDX(i+1,j-1,dim)].blue+src[RIDX(i-1,j-1,dim)].blue+src[RIDX(i,j+1,dim)].blue+src[RIDX(i+1,j+1,dim)].blue+src[RIDX(i-1,j+1,dim)].blue)/9;
            dst[RIDX(i,j,dim)].green=(src[RIDX(i,j,dim)].green+src[RIDX(i+1,j,dim)].green+src[RIDX(i-1,j,dim)].green+src[RIDX(i,j-1,dim)].green+src[RIDX(i+1,j-1,dim)].green+src[RIDX(i-1,j-1,dim)].green+src[RIDX(i,j+1,dim)].green+src[RIDX(i+1,j+1,dim)].green+src[RIDX(i-1,j+1,dim)].green)/9;
        }
}

测试结果如下,发现提高了不少:

lab4-4

然后我认为还可以仿照rotate函数那样,提高代码的并行性,但是由于边界的平均算法不同,只能对一般情况做优化,32n-2我也看不出它一定是什么的倍数,为了保险就只分两路并行来写,然后同时对其他一些地方也做了一点调整。

优化代码如下:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
char smooth_descr[] = "smooth: Current working version";
void smooth(int dim, pixel *src, pixel *dst) 
{
    int i,j;
    //no using avg()

    //corners
    dst[0].red=(src[0].red+src[dim].red+src[1].red+src[dim+1].red)>>2;
    dst[0].blue=(src[0].blue+src[dim].blue+src[1].blue+src[dim+1].blue)>>2;
    dst[0].green=(src[0].green+src[dim].green+src[1].green+src[dim+1].green)>>2;

    dst[RIDX(0,dim-1,dim)].red=(src[RIDX(0,dim-1,dim)].red+src[RIDX(1,dim-1,dim)].red+src[RIDX(0,dim-2,dim)].red+src[RIDX(1,dim-2,dim)].red)>>2;
    dst[RIDX(0,dim-1,dim)].blue=(src[RIDX(0,dim-1,dim)].blue+src[RIDX(1,dim-1,dim)].blue+src[RIDX(0,dim-2,dim)].blue+src[RIDX(1,dim-2,dim)].blue)>>2;
    dst[RIDX(0,dim-1,dim)].green=(src[RIDX(0,dim-1,dim)].green+src[RIDX(1,dim-1,dim)].green+src[RIDX(0,dim-2,dim)].green+src[RIDX(1,dim-2,dim)].green)>>2;

    dst[RIDX(dim-1,0,dim)].red=(src[RIDX(dim-1,0,dim)].red+src[RIDX(dim-2,0,dim)].red+src[RIDX(dim-1,1,dim)].red+src[RIDX(dim-2,1,dim)].red)>>2;
    dst[RIDX(dim-1,0,dim)].blue=(src[RIDX(dim-1,0,dim)].blue+src[RIDX(dim-2,0,dim)].blue+src[RIDX(dim-1,1,dim)].blue+src[RIDX(dim-2,1,dim)].blue)>>2;
    dst[RIDX(dim-1,0,dim)].green=(src[RIDX(dim-1,0,dim)].green+src[RIDX(dim-2,0,dim)].green+src[RIDX(dim-1,1,dim)].green+src[RIDX(dim-2,1,dim)].green)>>2;

    dst[RIDX(dim-1,dim-1,dim)].red=(src[RIDX(dim-1,dim-1,dim)].red+src[RIDX(dim-1,dim-2,dim)].red+src[RIDX(dim-2,dim-1,dim)].red+src[RIDX(dim-2,dim-2,dim)].red)>>2;
    dst[RIDX(dim-1,dim-1,dim)].blue=(src[RIDX(dim-1,dim-1,dim)].blue+src[RIDX(dim-1,dim-2,dim)].blue+src[RIDX(dim-2,dim-1,dim)].blue+src[RIDX(dim-2,dim-2,dim)].blue)>>2;
    dst[RIDX(dim-1,dim-1,dim)].green=(src[RIDX(dim-1,dim-1,dim)].green+src[RIDX(dim-1,dim-2,dim)].green+src[RIDX(dim-2,dim-1,dim)].green+src[RIDX(dim-2,dim-2,dim)].green)>>2;

    //boarder
    for(i=1;i<dim-1;i++){
        int pos=i*dim
        dst[pos].red=(src[pos].red+src[pos-dim].red+src[pos-dim+1].red+src[pos+1].red+src[pos+dim].red+src[pos+dim+1].red)/6;
        dst[pos].blue=(src[pos].blue+src[pos-dim].blue+src[pos-dim+1].blue+src[pos+1].blue+src[pos+dim].blue+src[pos+dim+1].blue)/6;
        dst[pos].green=(src[pos].green+src[pos-dim].green+src[pos-dim+1].green+src[pos+1].green+src[pos+dim].green+src[pos+dim+1].green)/6;
    }

    for(i=1;i<dim-1;i++){
        int pos=i*dim;
        dst[pos+dim-1].red=(src[pos+dim-1].red+src[pos-1].red+src[pos-2].red+src[pos-2+dim].red+src[pos+dim+dim-1].red+src[pos+dim+dim-2].red)/6;
        dst[pos+dim-1].blue=(src[pos+dim-1].blue+src[pos-1].blue+src[pos-2].blue+src[pos-2+dim].blue+src[pos+dim+dim-1].blue+src[pos+dim+dim-2].blue)/6;
        dst[pos+dim-1].green=(src[pos+dim-1].green+src[pos-1].green+src[pos-2].green+src[pos-2+dim].green+src[pos+dim+dim-1].green+src[pos+dim+dim-2].green)/6;
    }

    for(j=1;j<dim-1;j++){
        int pos=j;
        dst[pos].red=(src[pos].red+src[pos-1].red+src[RIDX(1,j-1,dim)].red+src[RIDX(1,j,dim)].red+src[RIDX(0,j+1,dim)].red+src[RIDX(1,j+1,dim)].red)/6;
        dst[pos].blue=(src[pos].blue+src[pos-1].blue+src[RIDX(1,j-1,dim)].blue+src[RIDX(1,j,dim)].blue+src[RIDX(0,j+1,dim)].blue+src[RIDX(1,j+1,dim)].blue)/6;
        dst[pos].green=(src[pos].green+src[pos-1].green+src[RIDX(1,j-1,dim)].green+src[RIDX(1,j,dim)].green+src[RIDX(0,j+1,dim)].green+src[RIDX(1,j+1,dim)].green)/6;
    }

    for(j=1;j<dim-1;j++){
        int pos=j+dim*(dim-1);
        dst[pos].red=(src[pos].red+src[pos+1].red+src[pos-1].red+src[pos-dim].red+src[pos-dim+1].red+src[pos-dim-1].red)/6;
        dst[pos].blue=(src[pos].blue+src[pos+1].blue+src[pos-1].blue+src[pos-dim].blue+src[pos-dim+1].blue+src[pos-dim-1].blue)/6;
        dst[pos].green=(src[pos].green+src[pos+1].green+src[pos-1].green+src[pos-dim].green+src[pos-dim+1].green+src[pos-dim-1].green)/6;
    }

    //common
    for(i=1;i<dim-1;i+=2)
        for(j=1;j<dim-1;j++){
            int pos=i*dim+j;
            dst[pos].red=(src[pos].red+src[pos+dim].red+src[pos-dim].red+src[pos-1].red+src[pos+dim-1].red+src[pos-dim-1].red+src[pos+1].red+src[pos+1+dim].red+src[pos+1-dim].red)/9;
            dst[pos].blue=(src[pos].blue+src[pos+dim].blue+src[pos-dim].blue+src[pos-1].blue+src[pos+dim-1].blue+src[pos-dim-1].blue+src[pos+1].blue+src[pos+1+dim].blue+src[pos+1-dim].blue)/9;
            dst[pos].green=(src[pos].green+src[pos+dim].green+src[pos-dim].green+src[pos-1].green+src[pos+dim-1].green+src[pos-dim-1].green+src[pos+1].green+src[pos+1+dim].green+src[pos+1-dim].green)/9;

            pos+=dim;

            dst[pos].red=(src[pos].red+src[pos+dim].red+src[pos-dim].red+src[pos-1].red+src[pos+dim-1].red+src[pos-dim-1].red+src[pos+1].red+src[pos+1+dim].red+src[pos+1-dim].red)/9;
            dst[pos].blue=(src[pos].blue+src[pos+dim].blue+src[pos-dim].blue+src[pos-1].blue+src[pos+dim-1].blue+src[pos-dim-1].blue+src[pos+1].blue+src[pos+1+dim].blue+src[pos+1-dim].blue)/9;
            dst[pos].green=(src[pos].green+src[pos+dim].green+src[pos-dim].green+src[pos-1].green+src[pos+dim-1].green+src[pos-dim-1].green+src[pos+1].green+src[pos+1+dim].green+src[pos+1-dim].green)/9;
        }
}

测试结果如下,发现这次优化并没有提高多少,可能是并行度不高的原因:

lab4-5