跳转至

索引 Indexing

  • Form:search key - pointer

Query type

  • Point query: records with a specified value in the attribute

  • Range query: records with an attribute value falling in a specified range of values.

  • Primary index 主索引

  • image-20230610160436673

  • candidate key唯一情况下用起来方便

  • Secondary Indices

  • image-20230610160508100

  • 不唯一,中间的指针指向多个目标
  • Dense index:Index record appears for every search-key value in the file.

  • image-20230610160725335

  • Sparse index:contains index records for only some search-key values.

  • image-20230610160742907

  • image-20230610160938171

  • Multilevel Index(多级索引)

  • 索引的索引

  • image-20230610161015946

B+ Tree Index

  • 每个节点都和block大小一样
  • image-20230610161325306
  • 叶子结点就是末级索引,之间用指针相连,使连续
  • 高度:$ log_{n}(k)<= height <=log_{n/2}(k/2)+1 $,向上取整(根节点最小两叉)
  • 大小:最大:半满,最小:全满
  • 和ADS不同:每个节点容量相等,非叶子和叶子一样(ADS里非叶子最多阶数-1)
    • 也就是说中间结点的分叉数比叶子结点值数多1
  • 作用:叉数多,层数少,减少seek&block transfer次数

  • 估计 height & size

  • person( pid char(18) primary key,

    ​ name char(8),

    ​ age smallint, ​ address char(40));

    Block size : 4K 1000000 persons

  • Records per block = 4096/(18+8+2+40) =60.235=60

    • record 大小根据各个属性类型算
  • blocks for storing 1M persons= 1000000/60 =16667

    • 计算block数
  • B+ tree n(fan-out) = (4096-4)/(18+4) +1 = 187

    • B+树一个节点就是一个block,存放M个值和M+1个指针

    即使是叶子结点,也多出一个指向下一个叶子的指针

    指针比值多一个,所以先-4,后+1

    • 一个索引项=索引值+指针(假设4byte)=18+4

    • 最大187叉,最少 n/2 = 94叉

    • 能索引多少值

    • 2 levels: min=2*93 = 186 max= 187'*186 = 34,782

    • 3 levels: min=2*94*93 = 17484 max=187*187*186 = 6,504,234

    • 4 levels: min=2*94*94*93 = 1,643,496

      ​ max=187*187*187*186 = 1,216,291,758

    • 易得这个B+树为3层

  • size:

    • 最小(全满)100000/186+100000/186/187+1
    • 最大(半满)100000/93+100000/93/94+1

Bottom-up B+ Tree Build

  • image-20230610171021418
  • fanout = 4,则阶数为3
  • 先排序,按序分块,然后向上构建
  • 构建上图B+树 cost:1 seek + 9 block transfer
  • 插入大量值/合并树可以直接把叶子merge并排序,然后重建
  • image-20230610171558671
  • 原树的叶子读取:1 seek+6 block transfer

Indexing in Main Memory

  • Random access in memory
    • Much cheaper than on disk/flash, but still expensive compared to cache read
    • Binary search for a key value within a large B+-tree node results in many cache misses
    • Data structures that make best use of cache preferable – cache conscious
  • Cache miss
    • HD以block为单位读到buffer,buffer以64byte(例)为单位读到cache,大节点就会读不全,查找索引值过程中读取的cache只有小部分有用,没找到->产生miss
    • 降低miss:
    • 小节点:B+ trees with small nodes that fit in cache line are preferable to reduce cache misses
    • 指针和search key分开排
    • 建立一个“路标”(一棵小树)
  • Key idea:
    • use large node size to optimize disk access,
    • but structure data within a node using a tree with small node size, instead of using an array, to optimize cache access.

LSM tree

  • Log Structured Merge:写优化的树结构
  • image-20230610175401820
  • Benefits of LSM approach Inserts are done using only sequential I/O operations 随机写->顺序写 Leaves are full, avoiding space wastage Reduced number of I/O operations per record inserted as compared to normal B+-tree (up to some size)
  • Drawback of LSM approach Queries have to search multiple trees Entire content of each level copied multiple times

  • LSM-Stepped Merge Index

  • image-20230610175540871

  • 内存满了直接写到下一层来,disk中这一层满了再merge写到下一层
  • 删除:插入删除标记
  • 更新:删除+插入

  • Buffer Tree

  • image-20230619143817724

随堂测试

  • 1.Indexing mechanisms are used to speed up access to desired data.

    判断题 (1 分)

    A.Yes

    B.No

    正确答案: A

  • 2.Range query returns records with an attribute value falling in a specified range of values.

    判断题 (1 分)

    A.Yes

    B.No

    正确答案: A

  • 3.Secondary index is an index whose search key specifies an order same as the sequential order of the file.

    判断题 (1 分)

    A.Yes

    B.NO

    正确答案: B

  • 4.In an dense index, index record appears for every search-key value in the file.

    判断题 (1 分)

    A.Yes

    B.No

    正确答案: A

  • 5.In a B+-tree , all paths from root to leaf are of the same length.

    判断题 (1 分)

    A.Yes

    B.No

    正确答案: A

  • 6.If the root of a B+-tree is not a leaf, it has at least 2 children.

    判断题 (1 分)

    A.Yes

    B.No

    正确答案: A

  • 7.In databases, a node of a B+-tree is generally the same size as a disk block.

    判断题 (1 分)

    A.Yes

    B.No

    正确答案: A

  • 8.The leaf nodes of a B+-tree file organization store records, instead of pointers to records.

    判断题 (1 分)

    A.Yes

    B.No

    正确答案: A

  • 9.Benefits of LSM approach:

    多选题 (1 分)

    A.Inserts are done using only sequential I/O operations

    B.Leave nodes are full, avoiding space wastage

    C.Reduced number of I/O operations per record inserted as compared to normal B+-tree.

    D.Queries have to search multiple trees

    E.Entire content of each level copied multiple times

    正确答案: A B C

  • 10.Bitmap indices are useful for queries on multiple attributes,not particularly useful for single attribute queries.

    判断题 (1 分)

    A.Yes

    B. No

    正确答案: A