[INNODB] ibd文件之结构组成与空间分配

ibd是一个完整的表空间文件，其中最基本的单位为页(Page)，一个Page通常为16k。page有不同的种类，用于不同类型的用途。

在介绍页文件的分类前，我们先介绍对页进行分类管理的2个页组。

extend

extend，也叫作区，用于分配页。一个extend有64个连续的页，也就是1MB。当表空间页不足了要分配新页的时候，不会一页一页的分配，而是直接分配一个extend。

segment

segment，也叫作段，用于管理extend，一个表至少会占有2个segment，一个segment用来管理nonleaf page所在的extend，另一个用于管理leaf page所在的extend。每多一个索引，segment会多2个。

下面来介绍常见的page的类型以及其用途。

Page

Page是ibd文件最基本的单位，无论page类型，每种page的基础结构为：

type Page struct {
    // Start with file header
    fheader FileHeader
    // page body
    // checksum && lsn
    trailer [8]byte
}

每个page都有一个头部，还有一个尾部。头部标识了类型、checksum等信息，尾部标识了checksum等信息用于确认该页是否写入完整。根据头部的page类型标识，我们可以将page body解析成对应的类型，主要有以下几种类型：

file space header / xdes

这两个page类型其实使用了一种page结构。该类型的page主要用于管理extend的分配和所属的segment。在这两个page的extend管理部分，主要标识了多少个extend已经被分配，每个extend内page的使用情况等等。

首先介绍下几个通用引用结构，一个是ListBaseNode，它的含义是指向了首尾元素，并标识了长度；还有一个是ListNode，用于双向链表，也就是指向它的前一个或者后一个元素。

该页的结构大致为：

type FSPHeader struct {
    // The space ID of the current space.
    spaceID uint32
    unused  uint32
    // The “size” is the highest valid page number, and is incremented
    // when the file is grown. However, not all of these pages are initialized
    // (some may be zero-filled), as extending a space is a multi-step process.
    highestPageNumberInFile uint32
    // The “free limit” is the highest page number for which the FIL header has
    // been initialized, storing the page number in the page itself, amongst other things.
    // The free limit will always be less than or equal to the size.
    highestPageNumberInitialized uint32
    // Storage of flags related to the space.
    Flags uint32
    //
    pagesUsedInFreeFrag uint32
    // Extents that are completely unused and available to be allocated in
    // whole to some purpose. A FREE extent could be allocated to a file
    // segment (and placed on the appropriate INODE list), or moved
    // to the FREE_FRAG list for individual page use.
    freeList ListBaseNode
    // Extents with free pages remaining that are allocated to be used in “fragments”,
    // having individual pages allocated to different purposes rather than allocating
    // the entire extent. For example, every extent with an FSP_HDR or XDES page will be
    // placed on the FREE_FRAG list so that the remaining free pages in the extent can be
    // allocated for other uses.
    freeFragList ListBaseNode
    // Exactly like FREE_FRAG but for extents with no free pages remaining. Extents are
    // moved from FREE_FRAG to FULL_FRAG when they become full, and moved back to FREE_FRAG
    // if a page is released so that they are no longer full.
    fullFragList ListBaseNode
    // The file segment ID that will be used for the next allocated file segment.
    // (This is essentially an auto-increment integer.)
    nextUnusedSegmentID uint64
    fullInodesList      ListBaseNode
    freeInodesList      ListBaseNode
}

只有第一个fsp页的该结构的数据有意义，它用于管理整个表空间各个extend的分配。下面列出几个重要域的意义：

freeList: 用于管理所有的空闲extend
freeFragList: 用于管理所有的碎片页，也就是管理有剩余可用页的extend
fullFragList: 用于管理所有已没有剩余可用页的extend

以上几个都是用于管理extend的，这些信息对于分配extend来说非常重要。首先所有页都可用的空闲页会在freeList中，freeList中的extend会被分配到两个地方：

分配到freeFragList，用于碎片页的分配，而不是将整个extend都分配过去。假设被分配到了freeFragList中，则对应的extend的state也会发生改变。当freeFragList中的碎片页区被使用了之后，会被移入fullFragList，并且更新对应extend的state
分配到inode中。这个后续会进行讨论

紧跟着的就是256个XdesEntry了：

// XdesEntry describe which pages within the extend are in use
type XdesEntry struct {
    // The ID of the file segment to which the extent belongs,
    // if it belongs to a file segment.
    fileSegmentID uint64
    // Pointers to previous and next extents in a doubly-linked extent descriptor list.
    // 6bytes
    list ListNode
    // State: The current state of the extent, for which only four values are currently
    // defined: FREE, FREE_FRAG, and FULL_FRAG, meaning this extent belongs to the
    // space’s list with the same name; and FSEG, meaning this extent belongs to
    // the file segment with the ID stored in the File Segment ID field. (More on these lists below.)
    // TODO: get the definition of state
    state uint32
    // Page State Bitmap: A bitmap of 2 bits per page in the extent (64 x 2 = 128 bits, or 16 bytes).
    // The first bit indicates whether the page is free. The second bit is reserved to indicate whether
    // the page is clean (has no un-flushed data), but this bit is currently unused and is always set to 1.
    pageStateBitmap [16]byte
}

该page的extend管理部分，有256个结构组成的数组，分别代表256个extend。由于一个xdes页只能管理256个extend，所以当文件大小超过了256MB之后，会分配新的xdes页，用于描述接下来256个extend的信息。

一个XdesEntry用于描述一个extend，下面是几个重要的信息：

fileSegmentID: 标识该extend属于哪一个file segment inode，inode会在后面进行介绍
list: 用于指向前一个或者后一个extend
state: 用于标识该extend的状态
pageStateBitmap: 用于标识该extend下64个页的使用情况

file segment inode

inode用于管理segment，将extend进行分组来进行管理。首先我们来看下inode的结构：

type INode struct {
    inodePageList ListNode
    inodes        [85]*INodeEntry
}

inodePageList: 用于连接所有的inode页，这种情况很少发生，当一个页的inode用完了才会，比如有42个索引，这样的会分配新的inode页，并相互连接
inodes: 各个segment的信息

下面是INodeEntry的定义：

type INodeEntry struct {
    // The ID of the file segment (FSEG) described by this
    // file segment INODE entry. If the ID is 0, the entry is unused.
    fileSegmentID uint64
    // Exactly like the space’s FREE_FRAG list (in the FSP header),
    // this field stores the number of pages used in the NOT_FULL list as an
    // optimization to be able to quickly calculate the number of free pages
    // in the list without iterating through all extents in the list.
    usedPagesInNotFullList uint32
    // Extents that are completely unused and are allocated to this file segment.
    freeList ListBaseNode
    // Extents with at least one used page allocated to this file
    // segment. When the last free page is used, the extent is moved to the FULL list.
    notFullList ListBaseNode
    // Extents with no free pages allocated to this file segment.
    // If a page becomes free, the extent is moved to the NOT_FULL list.
    fullList ListBaseNode
    // The value 97937874 is stored as a marker that this
    // file segment INODE entry has been properly initialized.
    magicNumber uint32
    // An array of 32 page numbers of pages allocated individually from
    // extents in the space’s FREE_FRAG or FULL_FRAG list of “fragment” extents.
    // Once this array becomes full, only full extents can be allocated to the file segment.
    fragmentArrayEntry [32]uint32
}

fileSegmentID: 该inode的segment id
freeList: 空闲的未使用的extend，该extend已经被分配到该segment中
notFullList: 有未使用页的extend
fullList: 没有可用页的extend
fragmentArrayEntry: 碎片页数组

在这里，fragmentArrayEntry比较重要，牵扯到了innodb的页分配策略。首先一个表被新建后，该表的尺寸为96KB，也就是有6页。除了三页的初始页，还有3页处于空闲状态，这6页也属于一个extend，剩余的58页还没有被分配。该extend会挂在file space header中的freeFragList中，用于碎片页的分配。

当向新建表插入数据的时候，按照上述所说，每次分配1个extend，那么文件大小应该会增长到1120KB，然而实际上我们看到，文件尺寸依旧是96KB。我们通过innoisp来查看信息：

==========PAGE 0==========
page num 0, offset 0x00000000, page type <File space header> 

==========PAGE 1==========
page num 1, offset 0x00004000, page type <Insert Buffer bit map> 

==========PAGE 2==========
page num 2, offset 0x00008000, page type <File segment inode> 

==========PAGE 3==========
page num 3, offset 0x0000C000, page type <Index> level <0> 

==========PAGE 4==========
page num 4, offset 0x00010000, page type <Allocated> 

==========PAGE 5==========
page num 5, offset 0x00014000, page type <Allocated>

可以看到，并没有分配新的extend，那page 3是从哪里进行分配的呢？我们继续使用innoisp来查看inode信息：

            ==========PAGE 2 OFFSET 0x8000==========
page list                                          
0xFFFFFFFF:0x0000 0xFFFFFFFF:0x0000                

file segment id     used(nf)  free list                                          not_full list                                      full list                                          fragment array
0x00000032:1        0         len<0> 0xFFFFFFFF:0x0000 0xFFFFFFFF:0x0000         len<0> 0xFFFFFFFF:0x0000 0xFFFFFFFF:0x0000         len<0> 0xFFFFFFFF:0x0000 0xFFFFFFFF:0x0000         3 (page allocate)
0x000000F2:2        0         len<0> 0xFFFFFFFF:0x0000 0xFFFFFFFF:0x0000         len<0> 0xFFFFFFFF:0x0000 0xFFFFFFFF:0x0000         len<0> 0xFFFFFFFF:0x0000 0xFFFFFFFF:0x0000         (page allocate)

这里就比较清楚了，我们可以看到fragment中有值了，为3，意思就是page 3是从file space header中的freeFragList中分配的，我们进行一次验证：

            ==========PAGE 0 OFFSET 0x0000==========
space id  page allo  page init  flags   page used(fg)  free_frag list                                     free list                                          full_frag list                                     next segment id  full inodes                                        free inodes                                        
4399      6          64         0x0000  4              len<1> 0x00000000:0x0096 0x00000000:0x0096         len<0> 0xFFFFFFFF:0x0000 0xFFFFFFFF:0x0000         len<0> 0xFFFFFFFF:0x0000 0xFFFFFFFF:0x0000         3                len<0> 0xFFFFFFFF:0x0000 0xFFFFFFFF:0x0000         len<1> 0x00000002:0x0000 0x00000002:0x0000         

extend       page range          file segment id     state           page state (F)ree or (N)ot free
0(0x0096)    0-63                0x0000000000000000  0x00000002      NNNNFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF(60 free, 4 used)

我们可以看到，freeFragList指向了extend(page 0-63)，而该extend被分配出去了4页，也就是说，当一个segment新分配空间的时候，首先会在file space header中分配32个碎片页，当碎片页分配完后，则会分配完整的extend。

index page

下面就讨论数据了。数据在ibd中都是以B+ tree索引的数据。下面先看下index page的page body：

type PageIndexHeader struct {
    nDirSlots    uint16
    heapTop      uint16
    nHeap        uint16
    free         uint16
    garbage      uint16
    lastInsert   uint16
    direction    uint16
    nDirection   uint16
    nRecs        uint16
    maxTrxID     uint64
    level        uint16
    indexID      uint64
    leafInode    FileSegmentHeader
    nonleafInode FileSegmentHeader
}

nDirSlots: 页目录的数量
level: B+树的level，用于标识是否是叶子节点还是非叶子节点

其余的就不细说了，上述两个比较重要。在B+树中，数据都存在叶子节点中，也就是必须得从根节点到叶子节点才能获得数据，而不会在非叶子节点就获得数据，所以查找的性能比较稳定。

level为0标识该index page为叶子节点，也就是带有数据的节点，非0标识为非叶子节点，分为root index page和internal index page，root表示根节点，internal表示内部节点，都是索引页。

在index page中，无论有无数据，都会有2条系统记录，一条是infimum，一条是supremum。infimum表示最小的记录，supremum表示最大的记录。所有的recorder在index page中均会由单向链表组成为一整串升序记录。

在查找某一页的时候，页目录起到了很重要的作用。页目录是一个数组，指向了某几条记录。页目录不可能为空，其中必定有2个槽分别指向infimum和supremum。假设我们想对某一页进行遍历，只需要找到槽目录的最后一条（页目录是倒序的），就能找到infimum记录，infimum表示了最小的记录，则它的下一条就是我们实际插入的最小的记录，通过单向链表我们就能遍历所有的记录。

页目录是一个比较复杂的东西，页目录的介绍会放到另外一篇笔记中。

[INNODB] ibd文件之结构组成与空间分配