索引 Indexing¶
- Form:search key - pointer
Query type¶
-
Point query: records with a specified value in the attribute
-
Range query: records with an attribute value falling in a specified range of values.
-
Primary index 主索引
-

-
candidate key唯一情况下用起来方便
-
Secondary Indices
-

- 不唯一,中间的指针指向多个目标
-
Dense index:Index record appears for every search-key value in the file.
-

-
Sparse index:contains index records for only some search-key values.
-

-

-
Multilevel Index(多级索引)
-
索引的索引

B+ Tree Index¶
- 每个节点都和block大小一样

- 叶子结点就是末级索引,之间用指针相连,使连续
- 高度:$ log_{n}(k)<= height <=log_{n/2}(k/2)+1 $,向上取整(根节点最小两叉)
- 大小:最大:半满,最小:全满
- 和ADS不同:每个节点容量相等,非叶子和叶子一样(ADS里非叶子最多阶数-1)
- 也就是说中间结点的分叉数比叶子结点值数多1
-
作用:叉数多,层数少,减少seek&block transfer次数
-
估计 height & size
-
person( pid char(18) primary key,
name char(8),
age smallint, address char(40));
Block size : 4K 1000000 persons
-
Records per block = 4096/(18+8+2+40) =60.235=60
- record 大小根据各个属性类型算
-
blocks for storing 1M persons= 1000000/60 =16667
- 计算block数
-
B+ tree n(fan-out) = (4096-4)/(18+4) +1 = 187
- B+树一个节点就是一个block,存放M个值和M+1个指针
即使是叶子结点,也多出一个指向下一个叶子的指针
指针比值多一个,所以先-4,后+1
-
一个索引项=索引值+指针(假设4byte)=18+4
-
最大187叉,最少 n/2 = 94叉
-
能索引多少值
-
2 levels: min=2
*93 = 186 max= 187'*186 = 34,782 -
3 levels: min=2
*94*93 = 17484 max=187*187*186 = 6,504,234 -
4 levels: min=2
*94*94*93 = 1,643,496 max=187
*187*187*186 = 1,216,291,758 -
易得这个B+树为3层
-
size:
- 最小(全满)100000/186+100000/186/187+1
- 最大(半满)100000/93+100000/93/94+1
Bottom-up B+ Tree Build¶

- fanout = 4,则阶数为3
- 先排序,按序分块,然后向上构建
- 构建上图B+树 cost:1 seek + 9 block transfer
- 插入大量值/合并树可以直接把叶子merge并排序,然后重建

- 原树的叶子读取:1 seek+6 block transfer
Indexing in Main Memory¶
- Random access in memory
- Much cheaper than on disk/flash, but still expensive compared to cache read
- Binary search for a key value within a large B+-tree node results in many cache misses
- Data structures that make best use of cache preferable – cache conscious
- Cache miss
- HD以block为单位读到buffer,buffer以64byte(例)为单位读到cache,大节点就会读不全,查找索引值过程中读取的cache只有小部分有用,没找到->产生miss
- 降低miss:
- 小节点:B+ trees with small nodes that fit in cache line are preferable to reduce cache misses
- 指针和search key分开排
- 建立一个“路标”(一棵小树)
- Key idea:
- use large node size to optimize disk access,
- but structure data within a node using a tree with small node size, instead of using an array, to optimize cache access.
LSM tree¶
- Log Structured Merge:写优化的树结构

- Benefits of LSM approach Inserts are done using only sequential I/O operations 随机写->顺序写 Leaves are full, avoiding space wastage Reduced number of I/O operations per record inserted as compared to normal B+-tree (up to some size)
-
Drawback of LSM approach Queries have to search multiple trees Entire content of each level copied multiple times
-
LSM-Stepped Merge Index
-

- 内存满了直接写到下一层来,disk中这一层满了再merge写到下一层
- 删除:插入删除标记
-
更新:删除+插入
-
Buffer Tree
-

随堂测试¶
-
1.Indexing mechanisms are used to speed up access to desired data.
判断题 (1 分)
A.Yes
B.No
正确答案: A
-
2.Range query returns records with an attribute value falling in a specified range of values.
判断题 (1 分)
A.Yes
B.No
正确答案: A
-
3.Secondary index is an index whose search key specifies an order same as the sequential order of the file.
判断题 (1 分)
A.Yes
B.NO
正确答案: B
-
4.In an dense index, index record appears for every search-key value in the file.
判断题 (1 分)
A.Yes
B.No
正确答案: A
-
5.In a B+-tree , all paths from root to leaf are of the same length.
判断题 (1 分)
A.Yes
B.No
正确答案: A
-
6.If the root of a B+-tree is not a leaf, it has at least 2 children.
判断题 (1 分)
A.Yes
B.No
正确答案: A
-
7.In databases, a node of a B+-tree is generally the same size as a disk block.
判断题 (1 分)
A.Yes
B.No
正确答案: A
-
8.The leaf nodes of a B+-tree file organization store records, instead of pointers to records.
判断题 (1 分)
A.Yes
B.No
正确答案: A
-
9.Benefits of LSM approach:
多选题 (1 分)
A.Inserts are done using only sequential I/O operations
B.Leave nodes are full, avoiding space wastage
C.Reduced number of I/O operations per record inserted as compared to normal B+-tree.
D.Queries have to search multiple trees
E.Entire content of each level copied multiple times
正确答案: A B C
-
10.Bitmap indices are useful for queries on multiple attributes,not particularly useful for single attribute queries.
判断题 (1 分)
A.Yes
B. No
正确答案: A