date: 2021-12-05
tags: db
在刷知乎的时候看到了对这篇论文的笔记:《Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask》(知乎链接),这篇文章主要是在数据库层面对比了向量化优化和编译优化,分析了这两个优化孰优孰劣。
在看这篇文章的时候,一个很自然的问题浮现出来。为什么这两个优化是相互矛盾的呢?文中是这么说的:
Although both models eliminate the overhead of traditional engines and are highly efficient, they are conceptually different from each other: Vectorization is based on the pull model (root-to-leaf traversal), vector-at-a-time processing, and interpretation. Datacentric code generation uses the push model (leaf-to-root traversal), tuple-at-a-time processing, and up-front compilation
也就是说,在数据库领域,一个是 pull based,另一个是 push based。
从伪代码的角度来说,vectorize 的目标是将一层循环转化成 2 层循环,其中里层的循环较为简单,方便自动进行并行处理,类似于从原本的:
for (int i = 0; i < num; i++) {
output[i] = process(input[i]);
}
变为:
for (int i = 0; i < num; i += batch) {
for (int j = i; j < min(i + batch, num); j++) {
output[j] = process(input[j]);
}
}
而编译则是把多个循环合并成同一个:
for (int i = 0; i < num; i++) {
tmp1[i] = process1(input[i]);
}
for (int i = 0; i < num; i++) {
tmp2[i] = process2(tmp1[i]);
}
for (int i = 0; i < num; i++) {
output[i] = process3(tmp2[i]);
}
变成:
for (int i = 0; i < num; i++) {
tmp1 = process1(input[i]);
tmp2 = process2(tmp1);
output[i] = process3(tmp2);
}