RocksDB的读流程分析

相对于写流程，Get的流程还是相对而言比较简单的，我们这里简单的梳理一下Get经过哪几个步骤，是怎么样进行处理的。

获取当前时刻的SuperVersion。SuperVersion是RocksDB内针对于所有SST文件列表以及内存中的Memtable和Immutable memtable的一个版本
获取当前的序号来决定当前读操作依赖的数据快照
尝试从第一步SuperVersion中引用的Memtable以及Immutable memtable中获取对应的值。首先会经过布隆过滤器，假如不存在则一定不存在，反之假如返回存在则不一定存在
尝试从Block cache中读取
尝试从sst文件中获取

流程其实是比较清晰的，但是涉及到Memtable的查找，以及SST内数据的查找，相对而言还是有很多细节的。

Memtable的查找

首先我们来看一下，如何针对Memtable来进行查找，我们这里的Memtable的数据结构为Inline skiplist。查找的函数非常简单，如下：

  virtual void Get(const LookupKey& k, void* callback_args,
                   bool (*callback_func)(void* arg,
                                         const char* entry)) override {
    SkipListRep::Iterator iter(&skip_list_);
    Slice dummy_slice;
    for (iter.Seek(dummy_slice, k.memtable_key().data());
         iter.Valid() && callback_func(callback_args, iter.key());
         iter.Next()) {
    }
  }

通过迭代起来进行遍历，直到找到对应的key或者是遍历完所有数据。这里最终会调用到Inline skiplist的实现：

template <class Comparator>
inline void InlineSkipList<Comparator>::Iterator::Seek(const char* target) {
  node_ = list_->FindGreaterOrEqual(target);
}

template <class Comparator>
typename InlineSkipList<Comparator>::Node*
InlineSkipList<Comparator>::FindGreaterOrEqual(const char* key) const {
  // Note: It looks like we could reduce duplication by implementing
  // this function as FindLessThan(key)->Next(0), but we wouldn't be able
  // to exit early on equality and the result wouldn't even be correct.
  // A concurrent insert might occur after FindLessThan(key) but before
  // we get a chance to call Next(0).
  Node* x = head_;
  int level = GetMaxHeight() - 1;
  Node* last_bigger = nullptr;
  while (true) {
    Node* next = x->Next(level);
    if (next != nullptr) {
      PREFETCH(next->Next(level), 0, 1);
    }
    // Make sure the lists are sorted
    assert(x == head_ || next == nullptr || KeyIsAfterNode(next->Key(), x));
    // Make sure we haven't overshot during our search
    assert(x == head_ || KeyIsAfterNode(key, x));
    int cmp = (next == nullptr || next == last_bigger)
                  ? 1
                  : compare_(next->Key(), key);
    if (cmp == 0 || (cmp > 0 && level == 0)) {
      return next;
    } else if (cmp < 0) {
      // Keep searching in this list
      x = next;
    } else {
      // Switch to next list, reuse compare_() result
      last_bigger = next;
      level--;
    }
  }
}

这个查找算法思路很清晰，从最高的height开始循环遍历当前层数链表上的所有节点，我们首先定义以下几个变量：

x。表示当层链表的起始遍历节点，初始化的时候为头节点
last_bigger。表示上一层遍历的最大的节点

在进行节点比较的时候，需要根据上一层的计算结果来计算当前层的结果，比较清晰的一点是上一层的结果应当在一个(x, last_bigger)的比较区间内。下面分析一下比较节点的几种场景：

当该层链表的某个节点的值小于要查找的节点的时候，则保存在x节点中
若大于要查找的节点，则保存于last_bigger节点中，并将层数降低，进行下一层的遍历查找
若相等则直接返回当前节点
若当前节点与last_bigger相等，则说明已经到了右边界了，比较结果必定是last_bigger大于当前节点（根据上一层的计算结果），这个时候假设已经到了Level 0了，那么直接返回当前的节点，因为它满足大于等于比较节点；反之降低层数继续查找

经过几层的查找，每一层假设没有查找到，那么也会缩小查找范围，最终会返回一个最接近于并大于等于被查找节点的节点。

在找出该节点后，则会调用callback_func来回调查找结果，最终会落到rocksdb::SaveValue函数中。在该函数中，判断是否需要查找的目标key，假如不是则没有找到值；找到后则会经过internal key的解码操作获取到真实的值。

Immutable memtable和memtable是一致的，他们使用了相同的数据结构。

SST的查找

假设数据不在Memtable和Immutable memtable还有Block cache中的话，则会从磁盘中加载SST数据并进行查找。

首先我们得知晓两个概念。在Level0(L0)中的数据文件是可能会有数据重叠部分的，所以在L0中查找一个key，我们需要依次查找所有的SST文件；而对于非L0层级，每个文件都是按顺序存放的。知晓了这两个概念之后，我们再来看一个结构，名称为IndexUnit，用于借助上层(非L0)的该信息来缩小当前层的文件查找范围，我们先看下它的定义：

  struct IndexUnit {
    IndexUnit()
      : smallest_lb(0), largest_lb(0), smallest_rb(-1), largest_rb(-1) {}
    // During file search, a key is compared against smallest and largest
    // from a FileMetaData. It can have 3 possible outcomes:
    // (1) key is smaller than smallest, implying it is also smaller than
    //     larger. Precalculated index based on "smallest < smallest" can
    //     be used to provide right bound.
    // (2) key is in between smallest and largest.
    //     Precalculated index based on "smallest > greatest" can be used to
    //     provide left bound.
    //     Precalculated index based on "largest < smallest" can be used to
    //     provide right bound.
    // (3) key is larger than largest, implying it is also larger than smallest.
    //     Precalculated index based on "largest > largest" can be used to
    //     provide left bound.
    //
    // As a result, we will need to do:
    // Compare smallest (<=) and largest keys from upper level file with
    // smallest key from lower level to get a right bound.
    // Compare smallest (>=) and largest keys from upper level file with
    // largest key from lower level to get a left bound.
    //
    // Example:
    //    level 1:              [50 - 60]
    //    level 2:        [1 - 40], [45 - 55], [58 - 80]
    // A key 35, compared to be less than 50, 3rd file on level 2 can be
    // skipped according to rule (1). LB = 0, RB = 1.
    // A key 53, sits in the middle 50 and 60. 1st file on level 2 can be
    // skipped according to rule (2)-a, but the 3rd file cannot be skipped
    // because 60 is greater than 58. LB = 1, RB = 2.
    // A key 70, compared to be larger than 60. 1st and 2nd file can be skipped
    // according to rule (3). LB = 2, RB = 2.
    //
    // Point to a left most file in a lower level that may contain a key,
    // which compares greater than smallest of a FileMetaData (upper level)
    int32_t smallest_lb;
    // Point to a left most file in a lower level that may contain a key,
    // which compares greater than largest of a FileMetaData (upper level)
    int32_t largest_lb;
    // Point to a right most file in a lower level that may contain a key,
    // which compares smaller than smallest of a FileMetaData (upper level)
    int32_t smallest_rb;
    // Point to a right most file in a lower level that may contain a key,
    // which compares smaller than largest of a FileMetaData (upper level)
    int32_t largest_rb;
  };

上面的结构主要有四个索引值，它们都是指向了下一层的文件索引。分别代表的含义如下：

smallest_lb。从左向右的顺序第一个含有任意key大于上一层对应文件的smallest key
largest_lb。从左向右的顺序第一个含有任意key大于上一层对应文件的largest key
smallest_rb。从右向左的顺序第一个含有任意key小于上一层对应文件的smallest key
largest_rb。从右向左的顺序第一个含有任意key小于上一层对应文件的largest key

其中_lb是用来计算left boundary的，而_rb是用来计算right boundary的。

有了此IndexUnit的支持，我们来看一下如何针对磁盘上的SST进行搜索：

首先我们需要遍历搜索Level 0(L0)层所有的文件，假设没有找到则继续
我们来到了L1层，该层数据文件是有序的，每一个文件都会有一个FileMetaData保存着该文件的最小key和最大key，同时还有IndexUnit。我们通过对L1层文件进行二分查找来找到第一个largest key大于要查找key的文件。我们继续对此SST文件内的key进行查找，假如没有找到，则继续。理论上到L1层我们可以再次进行二分查找来找到对应的key，但是借助IndexUnit我们可以快速过滤不需要查找的文件，提升查找速度。这里还有一点，假设要查找的key小于第一个文件的smallest key，则基准文件为第一个文件；假设要查找的key大于最后一个文件的largest key，则基准文件为最后一个文件。针对于key与基准文件key范围的比较结果，我们又有以下的几种场景：
- 当前要查找的key小于基准文件的smallest key。这种场景下，假设基准文件为该层第一个文件则我们需要查找的文件范围为 [First file on L(N+1), smallest_rb of current sst];若不是第一个文件，则要查找的范围为[largest_lb of previous sst, smallest_rb of current sst]。
- 当前要查找的key在基准文件的smallest key和largest key范围内。这种场景下，我们要查找的文件范围为 [smallest_lb of current sst, largest_rb of current sst]。
- 当前要查找的key大于基准文件的largest key。这种场景下，唯一的可能性为基准文件为该层的最后一个文件，则我们需要查找的文件范围为[largest_lb of current sst, Last file on L(N+1)]
针对上层的范围，对下层的SST进行搜索

大致流程如上，但是如果要查找的key是某个SST的largest key，那么假如此文件没有找到，那么需要查找下一个文件，原因是某个key可能会有多个internal key，会分散到下一个文件中，RocksDB中也有如下的说明：

The reason that two files may share the same user key boundary is that RocksDB store InternalKey in files which consists of user key, sequence number of key type. So files may store multiple InternalKeys with the same user key.

RocksDB的读流程分析