相对于写流程,Get的流程还是相对而言比较简单的,我们这里简单的梳理一下Get经过哪几个步骤,是怎么样进行处理的。
流程其实是比较清晰的,但是涉及到Memtable的查找,以及SST内数据的查找,相对而言还是有很多细节的。
首先我们来看一下,如何针对Memtable来进行查找,我们这里的Memtable的数据结构为Inline skiplist。查找的函数非常简单,如下:
virtual void Get(const LookupKey& k, void* callback_args,
bool (*callback_func)(void* arg,
const char* entry)) override {
SkipListRep::Iterator iter(&skip_list_);
Slice dummy_slice;
for (iter.Seek(dummy_slice, k.memtable_key().data());
iter.Valid() && callback_func(callback_args, iter.key());
iter.Next()) {
}
}
通过迭代起来进行遍历,直到找到对应的key或者是遍历完所有数据。这里最终会调用到Inline skiplist的实现:
template <class Comparator>
inline void InlineSkipList<Comparator>::Iterator::Seek(const char* target) {
node_ = list_->FindGreaterOrEqual(target);
}
template <class Comparator>
typename InlineSkipList<Comparator>::Node*
InlineSkipList<Comparator>::FindGreaterOrEqual(const char* key) const {
// Note: It looks like we could reduce duplication by implementing
// this function as FindLessThan(key)->Next(0), but we wouldn't be able
// to exit early on equality and the result wouldn't even be correct.
// A concurrent insert might occur after FindLessThan(key) but before
// we get a chance to call Next(0).
Node* x = head_;
int level = GetMaxHeight() - 1;
Node* last_bigger = nullptr;
while (true) {
Node* next = x->Next(level);
if (next != nullptr) {
PREFETCH(next->Next(level), 0, 1);
}
// Make sure the lists are sorted
assert(x == head_ || next == nullptr || KeyIsAfterNode(next->Key(), x));
// Make sure we haven't overshot during our search
assert(x == head_ || KeyIsAfterNode(key, x));
int cmp = (next == nullptr || next == last_bigger)
? 1
: compare_(next->Key(), key);
if (cmp == 0 || (cmp > 0 && level == 0)) {
return next;
} else if (cmp < 0) {
// Keep searching in this list
x = next;
} else {
// Switch to next list, reuse compare_() result
last_bigger = next;
level--;
}
}
}
这个查找算法思路很清晰,从最高的height开始循环遍历当前层数链表上的所有节点,我们首先定义以下几个变量:
在进行节点比较的时候,需要根据上一层的计算结果来计算当前层的结果,比较清晰的一点是上一层的结果应当在一个(x, last_bigger)的比较区间内。下面分析一下比较节点的几种场景:
经过几层的查找,每一层假设没有查找到,那么也会缩小查找范围,最终会返回一个最接近于并大于等于被查找节点的节点。
在找出该节点后,则会调用callback_func来回调查找结果,最终会落到rocksdb::SaveValue
函数中。在该函数中,判断是否需要查找的目标key,假如不是则没有找到值;找到后则会经过internal key
的解码操作获取到真实的值。
Immutable memtable和memtable是一致的,他们使用了相同的数据结构。
假设数据不在Memtable和Immutable memtable还有Block cache中的话,则会从磁盘中加载SST数据并进行查找。
首先我们得知晓两个概念。在Level0(L0)中的数据文件是可能会有数据重叠部分的,所以在L0中查找一个key,我们需要依次查找所有的SST文件;而对于非L0层级,每个文件都是按顺序存放的。知晓了这两个概念之后,我们再来看一个结构,名称为IndexUnit
,用于借助上层(非L0)的该信息来缩小当前层的文件查找范围,我们先看下它的定义:
struct IndexUnit {
IndexUnit()
: smallest_lb(0), largest_lb(0), smallest_rb(-1), largest_rb(-1) {}
// During file search, a key is compared against smallest and largest
// from a FileMetaData. It can have 3 possible outcomes:
// (1) key is smaller than smallest, implying it is also smaller than
// larger. Precalculated index based on "smallest < smallest" can
// be used to provide right bound.
// (2) key is in between smallest and largest.
// Precalculated index based on "smallest > greatest" can be used to
// provide left bound.
// Precalculated index based on "largest < smallest" can be used to
// provide right bound.
// (3) key is larger than largest, implying it is also larger than smallest.
// Precalculated index based on "largest > largest" can be used to
// provide left bound.
//
// As a result, we will need to do:
// Compare smallest (<=) and largest keys from upper level file with
// smallest key from lower level to get a right bound.
// Compare smallest (>=) and largest keys from upper level file with
// largest key from lower level to get a left bound.
//
// Example:
// level 1: [50 - 60]
// level 2: [1 - 40], [45 - 55], [58 - 80]
// A key 35, compared to be less than 50, 3rd file on level 2 can be
// skipped according to rule (1). LB = 0, RB = 1.
// A key 53, sits in the middle 50 and 60. 1st file on level 2 can be
// skipped according to rule (2)-a, but the 3rd file cannot be skipped
// because 60 is greater than 58. LB = 1, RB = 2.
// A key 70, compared to be larger than 60. 1st and 2nd file can be skipped
// according to rule (3). LB = 2, RB = 2.
//
// Point to a left most file in a lower level that may contain a key,
// which compares greater than smallest of a FileMetaData (upper level)
int32_t smallest_lb;
// Point to a left most file in a lower level that may contain a key,
// which compares greater than largest of a FileMetaData (upper level)
int32_t largest_lb;
// Point to a right most file in a lower level that may contain a key,
// which compares smaller than smallest of a FileMetaData (upper level)
int32_t smallest_rb;
// Point to a right most file in a lower level that may contain a key,
// which compares smaller than largest of a FileMetaData (upper level)
int32_t largest_rb;
};
上面的结构主要有四个索引值,它们都是指向了下一层的文件索引。分别代表的含义如下:
其中_lb是用来计算left boundary的,而_rb是用来计算right boundary的。
有了此IndexUnit的支持,我们来看一下如何针对磁盘上的SST进行搜索:
我们来到了L1层,该层数据文件是有序的,每一个文件都会有一个FileMetaData保存着该文件的最小key和最大key,同时还有IndexUnit。我们通过对L1层文件进行二分查找来找到第一个largest key大于要查找key的文件。我们继续对此SST文件内的key进行查找,假如没有找到,则继续。理论上到L1层我们可以再次进行二分查找来找到对应的key,但是借助IndexUnit我们可以快速过滤不需要查找的文件,提升查找速度。这里还有一点,假设要查找的key小于第一个文件的smallest key,则基准文件为第一个文件;假设要查找的key大于最后一个文件的largest key,则基准文件为最后一个文件。针对于key与基准文件key范围的比较结果,我们又有以下的几种场景:
[First file on L(N+1), smallest_rb of current sst]
;若不是第一个文件,则要查找的范围为[largest_lb of previous sst, smallest_rb of current sst]
。[smallest_lb of current sst, largest_rb of current sst]
。[largest_lb of current sst, Last file on L(N+1)]
针对上层的范围,对下层的SST进行搜索
大致流程如上,但是如果要查找的key是某个SST的largest key,那么假如此文件没有找到,那么需要查找下一个文件,原因是某个key可能会有多个internal key,会分散到下一个文件中,RocksDB中也有如下的说明:
The reason that two files may share the same user key boundary is that RocksDB store InternalKey in files which consists of user key, sequence number of key type. So files may store multiple InternalKeys with the same user key.