ibd是一个完整的表空间文件,其中最基本的单位为页(Page),一个Page通常为16k。page有不同的种类,用于不同类型的用途。
在介绍页文件的分类前,我们先介绍对页进行分类管理的2个页组。
extend,也叫作区,用于分配页。一个extend有64个连续的页,也就是1MB。当表空间页不足了要分配新页的时候,不会一页一页的分配,而是直接分配一个extend。
segment,也叫作段,用于管理extend,一个表至少会占有2个segment,一个segment用来管理nonleaf page所在的extend,另一个用于管理leaf page所在的extend。每多一个索引,segment会多2个。
下面来介绍常见的page的类型以及其用途。
Page是ibd文件最基本的单位,无论page类型,每种page的基础结构为:
type Page struct {
// Start with file header
fheader FileHeader
// page body
// checksum && lsn
trailer [8]byte
}
每个page都有一个头部,还有一个尾部。头部标识了类型、checksum等信息,尾部标识了checksum等信息用于确认该页是否写入完整。根据头部的page类型标识,我们可以将page body解析成对应的类型,主要有以下几种类型:
这两个page类型其实使用了一种page结构。该类型的page主要用于管理extend的分配和所属的segment。在这两个page的extend管理部分,主要标识了多少个extend已经被分配,每个extend内page的使用情况等等。
首先介绍下几个通用引用结构,一个是ListBaseNode,它的含义是指向了首尾元素,并标识了长度;还有一个是ListNode,用于双向链表,也就是指向它的前一个或者后一个元素。
该页的结构大致为:
type FSPHeader struct {
// The space ID of the current space.
spaceID uint32
unused uint32
// The “size” is the highest valid page number, and is incremented
// when the file is grown. However, not all of these pages are initialized
// (some may be zero-filled), as extending a space is a multi-step process.
highestPageNumberInFile uint32
// The “free limit” is the highest page number for which the FIL header has
// been initialized, storing the page number in the page itself, amongst other things.
// The free limit will always be less than or equal to the size.
highestPageNumberInitialized uint32
// Storage of flags related to the space.
Flags uint32
//
pagesUsedInFreeFrag uint32
// Extents that are completely unused and available to be allocated in
// whole to some purpose. A FREE extent could be allocated to a file
// segment (and placed on the appropriate INODE list), or moved
// to the FREE_FRAG list for individual page use.
freeList ListBaseNode
// Extents with free pages remaining that are allocated to be used in “fragments”,
// having individual pages allocated to different purposes rather than allocating
// the entire extent. For example, every extent with an FSP_HDR or XDES page will be
// placed on the FREE_FRAG list so that the remaining free pages in the extent can be
// allocated for other uses.
freeFragList ListBaseNode
// Exactly like FREE_FRAG but for extents with no free pages remaining. Extents are
// moved from FREE_FRAG to FULL_FRAG when they become full, and moved back to FREE_FRAG
// if a page is released so that they are no longer full.
fullFragList ListBaseNode
// The file segment ID that will be used for the next allocated file segment.
// (This is essentially an auto-increment integer.)
nextUnusedSegmentID uint64
fullInodesList ListBaseNode
freeInodesList ListBaseNode
}
只有第一个fsp页的该结构的数据有意义,它用于管理整个表空间各个extend的分配。下面列出几个重要域的意义:
以上几个都是用于管理extend的,这些信息对于分配extend来说非常重要。首先所有页都可用的空闲页会在freeList中,freeList中的extend会被分配到两个地方:
紧跟着的就是256个XdesEntry了:
// XdesEntry describe which pages within the extend are in use
type XdesEntry struct {
// The ID of the file segment to which the extent belongs,
// if it belongs to a file segment.
fileSegmentID uint64
// Pointers to previous and next extents in a doubly-linked extent descriptor list.
// 6bytes
list ListNode
// State: The current state of the extent, for which only four values are currently
// defined: FREE, FREE_FRAG, and FULL_FRAG, meaning this extent belongs to the
// space’s list with the same name; and FSEG, meaning this extent belongs to
// the file segment with the ID stored in the File Segment ID field. (More on these lists below.)
// TODO: get the definition of state
state uint32
// Page State Bitmap: A bitmap of 2 bits per page in the extent (64 x 2 = 128 bits, or 16 bytes).
// The first bit indicates whether the page is free. The second bit is reserved to indicate whether
// the page is clean (has no un-flushed data), but this bit is currently unused and is always set to 1.
pageStateBitmap [16]byte
}
该page的extend管理部分,有256个结构组成的数组,分别代表256个extend。由于一个xdes页只能管理256个extend,所以当文件大小超过了256MB之后,会分配新的xdes页,用于描述接下来256个extend的信息。
一个XdesEntry用于描述一个extend,下面是几个重要的信息:
inode用于管理segment,将extend进行分组来进行管理。首先我们来看下inode的结构:
type INode struct {
inodePageList ListNode
inodes [85]*INodeEntry
}
下面是INodeEntry的定义:
type INodeEntry struct {
// The ID of the file segment (FSEG) described by this
// file segment INODE entry. If the ID is 0, the entry is unused.
fileSegmentID uint64
// Exactly like the space’s FREE_FRAG list (in the FSP header),
// this field stores the number of pages used in the NOT_FULL list as an
// optimization to be able to quickly calculate the number of free pages
// in the list without iterating through all extents in the list.
usedPagesInNotFullList uint32
// Extents that are completely unused and are allocated to this file segment.
freeList ListBaseNode
// Extents with at least one used page allocated to this file
// segment. When the last free page is used, the extent is moved to the FULL list.
notFullList ListBaseNode
// Extents with no free pages allocated to this file segment.
// If a page becomes free, the extent is moved to the NOT_FULL list.
fullList ListBaseNode
// The value 97937874 is stored as a marker that this
// file segment INODE entry has been properly initialized.
magicNumber uint32
// An array of 32 page numbers of pages allocated individually from
// extents in the space’s FREE_FRAG or FULL_FRAG list of “fragment” extents.
// Once this array becomes full, only full extents can be allocated to the file segment.
fragmentArrayEntry [32]uint32
}
在这里,fragmentArrayEntry比较重要,牵扯到了innodb的页分配策略。首先一个表被新建后,该表的尺寸为96KB,也就是有6页。除了三页的初始页,还有3页处于空闲状态,这6页也属于一个extend,剩余的58页还没有被分配。该extend会挂在file space header中的freeFragList中,用于碎片页的分配。
当向新建表插入数据的时候,按照上述所说,每次分配1个extend,那么文件大小应该会增长到1120KB,然而实际上我们看到,文件尺寸依旧是96KB。我们通过innoisp来查看信息:
==========PAGE 0==========
page num 0, offset 0x00000000, page type <File space header>
==========PAGE 1==========
page num 1, offset 0x00004000, page type <Insert Buffer bit map>
==========PAGE 2==========
page num 2, offset 0x00008000, page type <File segment inode>
==========PAGE 3==========
page num 3, offset 0x0000C000, page type <Index> level <0>
==========PAGE 4==========
page num 4, offset 0x00010000, page type <Allocated>
==========PAGE 5==========
page num 5, offset 0x00014000, page type <Allocated>
可以看到,并没有分配新的extend,那page 3是从哪里进行分配的呢?我们继续使用innoisp来查看inode信息:
==========PAGE 2 OFFSET 0x8000==========
page list
0xFFFFFFFF:0x0000 0xFFFFFFFF:0x0000
file segment id used(nf) free list not_full list full list fragment array
0x00000032:1 0 len<0> 0xFFFFFFFF:0x0000 0xFFFFFFFF:0x0000 len<0> 0xFFFFFFFF:0x0000 0xFFFFFFFF:0x0000 len<0> 0xFFFFFFFF:0x0000 0xFFFFFFFF:0x0000 3 (page allocate)
0x000000F2:2 0 len<0> 0xFFFFFFFF:0x0000 0xFFFFFFFF:0x0000 len<0> 0xFFFFFFFF:0x0000 0xFFFFFFFF:0x0000 len<0> 0xFFFFFFFF:0x0000 0xFFFFFFFF:0x0000 (page allocate)
这里就比较清楚了,我们可以看到fragment中有值了,为3,意思就是page 3是从file space header中的freeFragList中分配的,我们进行一次验证:
==========PAGE 0 OFFSET 0x0000==========
space id page allo page init flags page used(fg) free_frag list free list full_frag list next segment id full inodes free inodes
4399 6 64 0x0000 4 len<1> 0x00000000:0x0096 0x00000000:0x0096 len<0> 0xFFFFFFFF:0x0000 0xFFFFFFFF:0x0000 len<0> 0xFFFFFFFF:0x0000 0xFFFFFFFF:0x0000 3 len<0> 0xFFFFFFFF:0x0000 0xFFFFFFFF:0x0000 len<1> 0x00000002:0x0000 0x00000002:0x0000
extend page range file segment id state page state (F)ree or (N)ot free
0(0x0096) 0-63 0x0000000000000000 0x00000002 NNNNFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF(60 free, 4 used)
我们可以看到,freeFragList指向了extend(page 0-63),而该extend被分配出去了4页,也就是说,当一个segment新分配空间的时候,首先会在file space header中分配32个碎片页,当碎片页分配完后,则会分配完整的extend。
下面就讨论数据了。数据在ibd中都是以B+ tree索引的数据。下面先看下index page的page body:
type PageIndexHeader struct {
nDirSlots uint16
heapTop uint16
nHeap uint16
free uint16
garbage uint16
lastInsert uint16
direction uint16
nDirection uint16
nRecs uint16
maxTrxID uint64
level uint16
indexID uint64
leafInode FileSegmentHeader
nonleafInode FileSegmentHeader
}
其余的就不细说了,上述两个比较重要。在B+树中,数据都存在叶子节点中,也就是必须得从根节点到叶子节点才能获得数据,而不会在非叶子节点就获得数据,所以查找的性能比较稳定。
level为0标识该index page为叶子节点,也就是带有数据的节点,非0标识为非叶子节点,分为root index page和internal index page,root表示根节点,internal表示内部节点,都是索引页。
在index page中,无论有无数据,都会有2条系统记录,一条是infimum,一条是supremum。infimum表示最小的记录,supremum表示最大的记录。所有的recorder在index page中均会由单向链表组成为一整串升序记录。
在查找某一页的时候,页目录起到了很重要的作用。页目录是一个数组,指向了某几条记录。页目录不可能为空,其中必定有2个槽分别指向infimum和supremum。假设我们想对某一页进行遍历,只需要找到槽目录的最后一条(页目录是倒序的),就能找到infimum记录,infimum表示了最小的记录,则它的下一条就是我们实际插入的最小的记录,通过单向链表我们就能遍历所有的记录。
页目录是一个比较复杂的东西,页目录的介绍会放到另外一篇笔记中。