1. 文件系统知识1

1.1. 文件相关数据结构

每个进程的task_struct结构体中, 包含一个struct files_struct* files结构体指针,指向该进程所有已经打开的文件/目录对应的相关信息的结构体.
files内部又包含struct file fd_array[]数组, 每一个都是struct file结构体(又被称作fd, 文件描述符), 每个file结构体中记录了被打开文件的基本信息, 如记录该文件对应的目录项dentry,
用于操作该文件的file_operations回调函数等, 从dentry中可以找到该文件对应的inode结构体, 从而得到该文件的各种属性信息.

fd_array数组的长度存放在max_fds字段中, linux系统中，一个进程打开的文件数是有初步限制的，即其文件描述符数初始时有最大化定量，即一个进程一般只能打开NR_OPEN_DEFAULT个文件，该值在32位机上为32个，在64位机上为64个
init fork一个子进程时, 也是如此.

1
2
3

do_fork======->copy_process
    |======->dup_task_struct======->alloc_task_struct
    |======->copy_files======->dup_fd======->alloc_files

1.1.1. struct files_struct扩充

进程打开的fd数目超过NR_OPEN_DEFAULT时, 对已经初始化的files_struct进行扩充
当进行struct files_struct扩充时，会分配一个新的struct fdtable, 分配并初始化新的struct fdtable变量后，原先指向fdtab的struct files_struct指针成员fdt，会指向新分配的struct fdtable变量。这时，struct files_struct实例变量中就包含两个struct fdtable存储区：一个是其自身的，一个新分配的，用fdt指向。执行完上述的操作后，还要将旧的结构存储区的内容拷贝到新的存储区，这包括files_struct自身所包含的close_on_exec，open_fds，fd到新分配的close_on_exec，open_fds，fd的拷贝。
执行完上述拷贝之后，就要释放旧的struct fdtable，但这里并不执行执行该项释放操作. (需要时机触发)
struct files_struct扩充使用内核源码中的expand_files来实现，expand_files会调用expand_fdtable：

static int expand_fdtable(struct files_struct *files, int nr)
{
    struct fdtable *new_fdt, *cur_fdt;
    new_fdt = alloc_fdtable(nr);      //分配了一个fdtable
    cur_fdt = files_fdtable(files);   //files->fdt
    if (nr >= cur_fdt->max_fds) {
        /* Continue as planned */
        copy_fdtable(new_fdt, cur_fdt);   //拷贝了其中的3个变量:fd,open_fds,close_on_exec
        rcu_assign_pointer(files->fdt, new_fdt);  //将新分配的fdtable赋值给files的fdt
          if (cur_fdt->max_fds > NR_OPEN_DEFAULT)  //注意它第一次初始化为NR_OPEN_DEFAULT
            free_fdtable(cur_fdt);
     }
    return 1;
}

扩充后, 内核会同时更新max_fds字段值.

files_struct扩充

本地FS(如ext4)在硬盘上维护的inode/dentry/superblock等数据结构, 与VFS在内存中维护的同名数据结构(上文task_struct使用的)是两套完全不同的东西, 虽然名称相同.
file_operations与address_space_operations: f_ops hook到vfs, a_ops完成page_cache访问

1.1.2. 标准输入输出文件描述符

对于在fd数组中所有元素的每个文件来说，数组的索引就是文件描述符(file descriptor)。通常，数组的第一个元素（索引为0）是进程的标准输入文件，数组的第二个元素（索引为1）是进程的标准输出文件，数组的第三个元素（索引为2）是进程的标准错误文件。请注意，借助于dup()、dup2()和fcntl()系统调用，两个文件描述符可以指向同一个打开的文件，也就是说，数组的两个元素可能指向同一个文件对象。当用户使用shell结构（如2>&1）将标准错误文件重定向到标准输出文件上时，用户也能看到这一点。

1.1.3. rlimit 设置进程可以打开文件的最大数目

进程不能使用多于NR_OPEN（通常为1 048 576)个文件描述符。内核也在进程描述符的signal->rlim[RLIMIT_NOFILE]结构上强制动态限制文件描述符的最大数；这个值通常为1024，但是如果进程具有超级用户特权，就可以增大这个值。

android rc中配置

1
2
3

# setrlimit <resource> <cur> <max>
# 在linux中分为软限制(soft limit)和硬限制(hard limit)的, 软限制可以在程序的进程中自行改变(突破限制)，而硬限制则不行(除非程序进程有root权限)
setrlimit 7  4096  4096

SYSCALL_DEFINE2(setrlimit, unsigned int, resource, struct rlimit __user *, rlim)
{
	struct rlimit new_rlim;
	if (copy_from_user(&new_rlim, rlim, sizeof(*rlim)))
		return -EFAULT;
	return do_prlimit(current, resource, &new_rlim, NULL);
}
static inline unsigned long rlimit(unsigned int limit)
{
	return task_rlimit(current, limit);
}
static int alloc_fd(unsigned start, unsigned flags)
{
	return __alloc_fd(current->files, start, rlimit(RLIMIT_NOFILE), flags);
}

1.2. open 调用流程

open过程是为待访问的具体目标文件创建和填充上述结构的过程
mount过程为文件系统根目录创建了VFS dentry/inode等结构

SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode)
{
	if (force_o_largefile())
		flags |= O_LARGEFILE;
    // AT_FDCWD  -100 该值表明当 filename 为相对路径的情况下将当前进程的工作目录设置为起始路径。
    // 也可以在另一个系统调用 openat 中为这个起始路径指定一个目录，此时 AT_FDCWD 就会被该目录的描述符所替代.
	return do_sys_open(AT_FDCWD, filename, flags, mode);
}

+-COMPAT_SYSCALL_DEFINE3() <COMPAT_SYSCALL_DEFINE3 (open, const char __user *, filename, int, flags, umode_t, mode) at open.c:1135>
  \-do_sys_open() <long do_sys_open (int dfd, const char __user *filename, int flags, umode_t mode) at open.c:1085>
    +-build_open_flags() <inline int build_open_flags (int flags, umode_t mode, struct open_flags op) at open.c:961>  #构建open_flags
    +-getname()          # 把用户空间数据复制到内核空间, 
    |                    # 1. 通过kmem_cache_alloc在内核缓冲区专用队列names_cachep里申请一块内存用来放置路径名，其实这块内存就是一个 4KB 的内存页
    |                    # 2. 如果文件路径长度大于EMBEDDED_NAME_MAX，则通过kzalloc分配内存。
    |                    # 3. 将文件路径字符串从用户态复制到内核态
    +-get_unused_fd_flags() <int get_unused_fd_flags (unsigned flags) at file.c:543> # 分配未使用的fd, 
    |     # 如果扩展操作导致当前进程的这个存放struct file的数组放不下了
    |     # 如要装第65个struct flie结构体，那么将重新分配一片内存区专门用来存放struct file结构体，并且这个内存区的大小为128个struct file结构体，
    |     # 然后将当前进程的task_struct->files_struct->fdtable->fd指针指向这片内存的首地址，然后把之前数组里面的内容复制到这片内存区里面来。
    |     # 下次添加如果超过了128个了，则分配256个大小直到256个也装满，超过256则分配512，依次类推，总是2的幂次方，且会把之前的复制到新分配的内存里面去.
    | \+-__alloc_fd(current->files, 0, rlimit(RLIMIT_NOFILE), flags);
    |   \+- rlimit
    |     \- task_rlimit
    +-do_filp_open()     # 一级一级查找到对应目录下对应文件系统的本地inode,初始化file对象
    |\ +- path_openat(&nd, op, flags)
    |  \ +- alloc_empty_file(op->open_flag, current_cred()) # 初始化file, 分配内存
    |    | \ +- __alloc_file(flags, cred)
    |    |    \ - kmem_cache_zalloc (filp_cachep, GFP_KERNEL) # 此处与内存管理slab分配器关联
    |    +- path_init(nd,flags); # 初始化nameidata进行“节点路径查找, 路径初始化，确定查找的起始目录，初始化结构体 nameidata 的成员 path(初始化父目录)
    |    +- link_path_walk(s, nd) # 节点路径查找，结果记录在nd中, 一级一级查找, 如果没有记录, 
    |    | \  +- lookup_fast      
    |    |    +- lookup_slow     # 先alloc dentry_cache, 初始化dentry结构, 再去本地文件系统中查找, 调用具体文件系统的lookup函数构造inode, 填充dentry
    |    |    |  \ +- "d_alloc_parallel(dir, name, &wq)"
    |    |    |    | \ +- kmem_cache_alloc(dentry_cache, GFP_KERNEL)   #创建dentry_cache
    |    |    |    +- inode->i_op->lookup(inode, dentry, flags)
                      \ +- ext4_lookup  #详情见 ext4_lookup的分析                     
    |    +- do_last <nd, file, op, &opened> # 创建(新文件)或者获取文件对应的inode对象，填充file 对象
    |      \ +- vfs_open <&nd->path, file>
    |         \ +- do_dentry_open (file, d_backing_inode(path->dentry), NULL) # 通过inode 填充file file跟inode 绑定, 检查open涉及的一些权限问题
                  \ +- security_file_open(f) # selinux 权限检查 selinux是hook到ext4中的
                      \ - selinux_file_open # LSM_HOOK_INIT(file_open, selinux_file_open)
                  | +- f->f_op->open <inode>                  
                      \ +- ext4_file_open (inode, file * filp)
                          \ - fscrypt_file_open(inode, filp) # 处理加密策略, 检查父目录的加密策略和文件的是否一致
                          | +- dquot_file_open -- generic_file_open  # ext4开启quota 特性后, #write一个节点时, 
                                                                     # quota相关的特性初始化 Initialize quota pointers in inode
                              \ -<f_mode & FMODE_WRITE>- __dquot_initialize <struct inode *inode, int type> 
         |- terminate_walk(nd) # 中止walk, rcu放锁有关, rcu不允许有阻塞
    +-fsnotify_open() # inotify机制, 通知文件打开事件
    +-fd_install() <void fd_install (unsigned int fd, struct file file) at file.c:611>  
    | #fd填充到fd_array数组中, 创建的文件对象指针存到该进程的打开文件数组中 fdt->fd[fd] = file
    \-putname() # putname()释放在内核分配的路径缓冲区

SYSCALL_DEFINE3(open...)	//open.c +1038
	do_sys_open()
		build_open_flags()
		struct filename *tmp = getname()
			getname_flags()
				kname = (char*)result + sizeof(*result)
				result->name = kname
				strncpy_from_user(kname, filename,max)
				result->uptr = filename
		fd = get_unused_fd_flags()
			__alloc_fd()
		struct file *f = do_filp_open()
			struct nameidata nd
			path_openat()
				file = get_empty_filp()
				file->f_flags = op->open_flag
				path_init()
					link_path_walk()
						may_lookup()
						walk_component()
							handle_dots()
							lookup_fast()	
							lookup_slow()	// 去真实的文件系统内查找
								__lookup_hash()
									lookup_dcache()
									lookup_real()
										dir->i_op->lookup()	//ext4_lookup
											inode = ext4_iget_normal()
												ext4_iget()
													struct ext4_inode * raw_inode
													struct inode *inode
													inode = iget_locked()
														inode = find_inode_fast()
														inode = alloc_inode()
														inode->i_ino = ino
														inode->i_state = I_NEW
														hlist_add_head()
														inode_sb_list_add()
													__ext4_get_inode_loc()
														stuct buffer_head *bh
														struct ext4_group_desc *gdp
														ext4_inode_table()
														iloc->block_group = ...
														iloc->offset = ...
														get_bh(bh)
														bh->b_endio = end_buffer_read_sync
														submit_bh*()
															submit_bio()
														wait_on_buffer()
														iloc->bh = ...
													raw_inode = ext4_raw_inode()
													inode->i_blocks = ext4_inode_blocks()
													inode->isize = ext4_isize()
													inode->i_op=...
													inode->i_fop= ...
							inode = path->dentry->d_inode
							nd->inode = inode
				do_last()
					handle_dots()
					lookup_fast()
					complete_walk()
						dentry->d_op->d_weak_revalidate()
					lookup_open()
						struct dentry *dir = nd->path.dentry
						struct inode *dir_inode = dir->d_inode
						lookup_dcache()
						atomic_open()
						lookup_real()
						vfs_create()
					audit_inode()
					mnt_want_write()
					may_open()
					vfs_open()
						struct inode *inode = path->dentry->d_inode
						inode->i_op->dentry_open()
						do_dentry_open()
							inode = f->f_inode = f->f_path.dentry->d_inode
							f->f_mapping = inode->i_mapping
							f->f_op=fops_get(inode->i_fop)
							open = f->f_op->open
							open(inode,f)
					terminate_walk()
		fsnotity_open()

open在linux内核的实现
open在内核中的实现2
open调用流程分析
 open七日游
 open调用流程代码分析
 详解sys_open

1.2.1. ext4_lookup 流程

上述open的流程调用下来, 如果dentry不在page cache中, 则会陷入到具体的文件系统中查找对应的inode

+- dentry *ext4_lookup <struct inode *dir, struct dentry *dentry, unsigned int flags>  #inode为父目录的inode结构, dentry未刚初始化, 未填充数据查找文件的dentry结构
    \ +-  fscrypt_prepare_lookup(dir, dentry, flags) #设置lookup策略, 与文件加密有关
    |  \ +- __fscrypt_prepare_lookup(struct inode *dir, struct dentry *dentry) 
    |       \- dentry->d_flags |= DCACHE_ENCRYPTED_WITH_KEY    # 
    |       |- d_set_d_op(dentry, &fscrypt_d_ops) # hook dentry_operations  .d_revalidate = fscrypt_d_revalidatefscrypt_d_revalidate
    | +-  ext4_find_entry(dir, &dentry->d_name, &de, NULL); #load 父目录dentry, 从父目录中查找对应文件的inode number 放到de (ext4_dir_entry_2)中
      |                       # direntry 实际上文件名保存在目录中，目录也是一个文件，也占用一个inode结构，它的数据块存储的是该目录下所以文件的文件名，以及各个文件对应的inode号。
    |  \ +- ext4_fname_setup_filename(dir, &dentry->d_name, 0, &fname);
    |    |  \ +-  fscrypt_setup_filename (dir, iname, lookup, &name) #设定了加密策略时, 会对文件名进行加解密
    |    |       \ +- fscrypt_get_encryption_info (struct inode *inode) #从父目录inode中获取加密上下文
    |             |   \-  "inode->i_sb->s_cop->get_context(inode, &ctx, sizeof(ctx))"
    |             |  fname_encrypt <dir, iname, fname->crypto_buf.name, fname->crypto_buf.len> # 加密文件名放在"fname->crypto_buf.name"下
    |             |  "fname->disk_name.name = fname->crypto_buf.name" # disk_name下存放加密的文件名
    |     | - ext4_has_inline_data <dir>  # 很小的数据可以直接存放在 inode 之间的空余空间里，根本无需单独分配数据块
    |            \- ext4_find_inline_entry <dir, &fname, res_dir, &has_inline_data> 从dir的 inode的剩余空间查找是否存了对应文件名的inode
    |     | - is_dx(dir) # EXT4 的 dir_index 特性 [dir_index](https://bean-li.github.io/EXT4_DIR_INDEX/) hash tree方式组织目录结构 ext4 默认打开
    |         \ - ext4_dx_find_entry(dir, &fname, res_dir)
    |     |+- ext4_bread_batch(dir, block, ra_max, false /* wait */, bh_use) #线性查找 跟buffer_head有关, 刷盘, Read a contiguous batch of blocks
    |         \ +- ext4_getblk(NULL, inode, block + i, 0)   #基于给定的inode，查找或创建block，并返回与其映射的buffer； create标识用于表示当查找不到时，是否执行分配块操作。 
    |                                                        #[ext4空间块管理](https://blog.csdn.net/younger_china/article/details/22759543)
    |           | \ +- bh = sb_getblk(inode->i_sb, map.m_pblk)
    |               |  \ +- __getblk_slow <struct block_device *bdev, sector_t block, unsigned size, gfp_t gfp> #从磁盘load 父目录 加载进dentry
    |                     |  \ +- grow_dev_page  #为requested block创建page-cache page 
    |                          | \- find_or_create_page < inode->i_mapping, index, gfp_mask> #struct inode *inode = bdev->bd_inode
    |     | - buffer_uptodate(bh) #buffer_head更新
    |     |+- search_dirblock(bh, dir, &fname,  block << EXT4_BLOCK_SIZE_BITS(sb), res_dir)         
    |       |  \ +- ext4_search_dir <>      #读出dentry的数据块, 遍历其下所有的文件名, 查看是否与查找的name相匹配, 
    |                                       #如果匹配返回 ext4_dir_entry_2结构 封装了查找文件的inode 号, 名字长度/名字等信息
    |            | \+- ext4_match <fname, de> #加密场景下是根据diskname 做匹配的
    |               |- fscrypt_match_name <const struct fscrypt_name *fname, const u8 *de_name, u32 de_name_len>
    | +- inode = ext4_iget <dir->i_sb, ino, EXT4_IGET_NORMAL>  #根据上面在目录块中找出的匹配文件名对应的ino号来从磁盘中load 文件对应的inode, 
          \                                  #参数为sb和ino(ino在特定sb中唯一)  [ext4_iget分析](https://blog.csdn.net/qq_32740107/article/details/93874383)
          | +- __ext4_iget <sb, ino, flags>
               \ +- inode = iget_locked <sb, ino>;   #首先尝试从inode cache中查找 
               |                               #根据inode_id在inode_hashtable中查找，若查找到直接返回，否则在inode_hashtable分配空的inode，并设置为I_NEW继续后续步骤
                   \ +- alloc_inode(sb)    #找不到, 分配inode inode
                      \ - inode = sb->s_op->alloc_inode(sb) #ext4_alloc_inode(sb) kmem_cache_alloc(ext4_inode_cachep, GFP_NOFS)  分配inode_cachep  #super_operations
                      | - inode_init_always(sb, inode) # inode初始化 
               |- __ext4_get_inode_loc(inode, &iloc, 0)  #读出磁盘相应inode数据, buffer_head, 并指定ext4_inode在bh的偏移量
               |- raw_inode = ext4_raw_inode(&iloc) # 获得目标索引节点对应的 ext4_inode   注意vfs中的inode描述与本地文件系统inode描述的区别
               |- ext4_set_aops(inode)... return inode # 根据raw_inode 校验/设置 inode 参数(imapping等),  填充inode, 返回inode
    | +- d_splice_alias(inode, dentry)  # 把inode加入到direntry树中返回 绑定dentry和inode
        \ +- __d_instantiate(struct dentry *dentry, struct inode *inode)
            \- hlist_add_head(&dentry->d_u.d_alias, &inode->i_dentry); #dentry和inode是多对一的关系, 将dentry挂入inode的i_dentry链表中
            |- __d_set_inode_and_type(dentry, inode, add_flags);  # dentry->d_inode = inode;

digraph G {
  rankdir=LR
  node [shape=record];
  
  ext4_dir_entry_2[ label="
      <0> ext4_dir_entry_2 |
      =======| <1> inode number |   <2>name_len|   <3> name [EXT4_NAME_LEN]|   file_type "]
  
  ext4_filename [label="
     <0> ext4_filename| =======| <1> qstr *usr_fname | <2>fscrypt_str disk_name |  dx_hash_info hinfo | fscrypt_str crypto_buf "]

  buffer_head [label="
     <0> buffer_head| =======| <1> char *b_data | block_device *b_bdev|  bh_end_io_t *b_end_io| b_count| ... "]

  buffer_head:1 -> ext4_dir_entry_2:0 [label="目录项的数据块,\n目录下的所有文件信息\n都封装在里面, 参考上图"]

    dentry [label="<f0> dentry|
      ======|
      <1> d_name|
      <f1> d_inode|
      <f2> d_sb|
      <f3> d_op|
      d_parent |
      d_child |
      d_subdirs |
      ..."]
      
   dentry:1 -> ext4_filename:1
   ext4_dir_entry_2:3 -> ext4_filename:2 [taillabel="compare" style=dotted color=blue]
   ext4_filename:1 -> ext4_filename:2 [label="encrypt" style=dotted]
}

1.2.2. 涉及到的slab内存分配

从上述open的调用流程中, 可以看到多次slab高速缓存的分配

inode = sb->s_op->alloc_inode(sb) #ext4_alloc_inode(sb) kmem_cache_alloc(ext4_inode_cachep, GFP_NOFS) 分配inode_cachep
kmem_cache_alloc(dentry_cache, GFP_KERNEL) #创建dentry_cache``
getname() # 在内核缓冲区专用队列names_cachep里申请一块内存用来放置路径名，其实这块内存就是一个 4KB 的内存页
kmem_cache_zalloc (filp_cachep, GFP_KERNEL) # 此处与内存管理slab分配器关联, 分配filp_cachep

文件系统在实现时，在vfs这一层的 inode cache 和 dentry cache，不管硬盘的系统，跨所有文件系统的通用信息。

针对这些cache，这些可以回收的slab，linux提供了专门的slab shrink- 收缩函数。
最后所有可回收的内存，都必须通过LRU算法去回收。
有些自己申请的 reclaim的内存，由于没有写 shrink函数，所以就无法进行内存的回收。

#free pagecache:
  echo 1 > /proc/sys/vm/drop_caches
#free reclaimable slab objects(dentry_cache inode_cache)
  echo 2 > /proc/sys/vm/drop_caches
# free reclaimable slab objects and page cache
  echo 3 > /proc/sys/vm/drop_caches

1.2.3. 文件访问小结

访问文件时，文件结构struct file,超级块结构super_block，inode结构,目录项dentry和address_space结构是重要的。

1.2.3.1. struct file文件的初始化

文件初始化过程是在文件的打开过程中完成的

读写一个文件时都是通过文件句柄fd找到 struct file，然后在通过file操作方法进行操作，那么file是何时创建的呢？
一般来说是open过程创建的struct file并绑定一个fd，如此后续读写操作可根据fd找file，而file的操作方法在finish_open->do_dentry_open中填充
file.f_op=inode.i_fop
文件的访问过程最重要的是文件的打开，即open过程，open时把大多数资源都初始化好，read、write等过程直接使用open是初始的一些信息即可，这些信息都是通过struct file结构绑定到fd上，从open传递到read,write等文件操作函数中的。

如何通过struct file结构体，找到文件的super_block,inode和address_space?

1
2
3

struct address_space *mapping = file->f_mapping; //打开文件时do_dentry_open初始化
structinode *inode  = mapping->host;
struct super_block *sb= mapping->host->i_sb

通过struct file结构找到inode和super_block?

对于同一个文件，如果打开两次，系统中对应的file地址是不同的，但inode和super_block是相同的.
对于同一mount目录，不同文件对应的inode是不同的但super_block是相同的.

1.3. EXT4 Extents

ext3/ext2 的data block索引

1.3.1. debug

从线刷包中解出system.img, 然后使用simg2img将其转换为Android Sparse 格式, 挂载到pc上

# 使用stat 查看 system/etc/permissions文件的inode号
stat privapp-permissions-miui.xml
设备：700h/1792d	Inode：1653        硬链接：1
# 使用istat 查看 inode号的data block索引
 istat system.img_ext4 1653
Direct Blocks:
255045 255046 255047 255048 255049 255050 255051 255052 
255053 255054

在没有istat工具情况下, 可以直接通过查看block的原始数据获得data block的索引
首先, 需要先看下inode中的extent的结构, linux kernel source
以android kernel 4.19版本查看,

struct ext4_inode {
	__le16	i_mode;		/* File mode */     2
	__le16	i_uid;		/* Low 16 bits of Owner Uid */ 2
	__le32	i_size_lo;	/* Size in bytes */ 4
	__le32	i_atime;	/* Access time */ 4
	__le32	i_ctime;	/* Inode Change time */ 4
	__le32	i_mtime;	/* Modification time */ 4
	__le32	i_dtime;	/* Deletion Time */ 4
	__le16	i_gid;		/* Low 16 bits of Group Id */ 2
	__le16	i_links_count;	/* Links count */ 2
	__le32	i_blocks_lo;	/* Blocks count */ 4
	__le32	i_flags;	/* File flags */ 4
	union {
		struct {
			__le32  l_i_version;
		} linux1;
		struct {
			__u32  h_i_translator;
		} hurd1;
		struct {
			__u32  m_i_reserved1;
		} masix1;
	} osd1;				/* OS dependent 1 */ 4
	__le32	i_block[EXT4_N_BLOCKS];/* Pointers to blocks */ EXT4_N_BLOCKS=15
   ...
};
struct ext4_extent {
	__le32	ee_block;	/* first logical block extent covers */ 4
	__le16	ee_len;		/* number of blocks covered by extent */ 2
	__le16	ee_start_hi;	/* high 16 bits of physical block */ 2
	__le32	ee_start_lo;	/* low 32 bits of physical block */ 4
};
struct ext4_extent_header {
	__le16	eh_magic;	/* probably will support different formats */ 2
	__le16	eh_entries;	/* number of valid entries */ 2
	__le16	eh_max;		/* capacity of store in entries */ 2
	__le16	eh_depth;	/* has tree real underlying blocks? */ 2
	__le32	eh_generation;	/* generation of the tree */ 4
};
struct ext4_extent_idx {
	__le32	ei_block;	/* index covers logical blocks from 'block' */ 4
	__le32	ei_leaf_lo;	/* pointer to the physical block of the next */ 4
				 /* level. leaf or next index could be there */ 
	__le16	ei_leaf_hi;	/* high 16 bits of physical block */ 2
	__u16	ei_unused; 2
};

对于全是Direct Blocks的情况, 对extents进行解读
在手机设备中首先inode结构体中 40个bytes处为extents的内容, 而iblock一共占(4*15) bytes, 占60个字节.

dump出ino 1653的内容: 需要先看下文件系统的Inodes per group | Inode blocks per group | Inode size

dumpe2fs system.img_ext4
Inode size:	          256
Inodes per group:       8064
Inode blocks per group:   504

inode号为1653说明其在第0号group 中 inode/8064 = 0, 所以需要看group 0的描述

Group 0: (Blocks 0-32767) [ITABLE_ZEROED]
  Checksum 0x0760, unused inodes 2069
  主 superblock at 0, Group descriptors at 1-1
  保留的GDT块位于 2-221
  Block bitmap at 222 (+222), Inode bitmap at 223 (+223)
  Inode表位于 224-727 (+224)
  0 free blocks, 2069 free inodes, 877 directories, 2069个未使用的inodes
  可用块数: 
  可用inode数: 5996-8064

注意inode table的位置Inode表位于 224-727 (+224), 1653号位于 224 + 1653/504的第1653 % 504个256(Inode size)区间字节
所以需要dump 327块的第5个256 区间字节

1 2	blkcat system.img_ext4 327 > temp.log xxd temp.log

或者使用 vi -b temp.log %!xxd 查看, 或者使用hexdump -C -v temp.log 查看

千万不能用 vi temp.log | %!xxd 查看, 数据会错乱

第5个 256区间即从 0x400-0x4f0

00000400  a4 81 00 00 56 9a 00 00  80 07 5c 49 80 07 5c 49  |....V.....\I..\I|
00000410  80 07 5c 49 00 00 00 00  00 00 01 00 50 00 00 00  |..\I........P...|
00000420  00 00 08 00 00 00 00 00  0a f3 01 00 04 00 00 00  |................|
00000430  00 00 00 00 00 00 00 00  0a 00 00 00 45 e4 03 00  |............E...|
00000440  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000450  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000460  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000470  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000480  20 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  | ...............|
00000490  80 07 5c 49 00 00 00 00  00 00 00 00 00 00 00 00  |..\I............|
000004a0  00 00 02 ea 07 06 40 00  00 00 00 00 1a 00 00 00  |......@.........|
000004b0  00 00 00 00 73 65 6c 69  6e 75 78 00 00 00 00 00  |....selinux.....|
000004c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000004d0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000004e0  00 00 00 00 75 3a 6f 62  6a 65 63 74 5f 72 3a 73  |....u:object_r:s|
000004f0  79 73 74 65 6d 5f 66 69  6c 65 3a 73 30 00 00 00  |ystem_file:s0...|

从inode结构中, 0x04位置即为文件的size, 可以初步验证下, 是否是对应的文件

手机为小端存储

0x9a56 为 39510 正好是stat看到的文件的size, 说明是正确的.
0x28-0x63 区间为extents的内容, 一个extent段占用12个字节( ext4_extent_header 结构体的大小)

1 2	00000420 00 00 08 00 00 00 00 00 \|\| 0a f3 01 00 04 00 00 00 \|................\| 00000430 00 00 00 00 \|\|00 00 00 00 0a 00 00 00 45 e4 03 00 \|............E...\|

0xf30a 为 ext4_extent_header的magic number
跟着 ext4_extent_header的结构看, 0x2a处为有效的extent段数目, 此处为1, 0x2c处为最大extent段数, 此处为4, 0x2e处为depth, 此处为0, 说明没有extent tree, 是直接索引.
0x34-0x3f区间为extent段, 占12个字节 (ext4_extent结构体大小), 跟着extent结构体往下看 (以|| 分割)

struct ext4_extent {
	__le32	ee_block;	/* first logical block extent covers */ 4     0
	__le16	ee_len;		/* number of blocks covered by extent */ 2    10
	__le16	ee_start_hi;	/* high 16 bits of physical block */ 2    0
	__le32	ee_start_lo;	/* low 32 bits of physical block */ 4     0x03e445
};

高16位为0, 低32位是0x03e445, 可表示48位逻辑地址

0x03e445 = 255045 正好与istat的结果符合, ee_len为10, 表示从255045开始的连续10个块号(255045-255054)都是该inode的data block

1.3.2. 非直接索引的情况

当EXT4需要大于4个extent时，它会创建一个在磁盘上创建一个树（b树）用来保存必须的extent数据，这就是extent头上的“树深度”(eh_depth)一项表达的含义。
在树最底层的叶子节点上，放置的是规则的extent结构(ext4_extent)，就像第一部分里展示的那样。但是在树的中间节点上，是不同的结构，称为extent索引(ext4_extent_idx)

digraph G {
    rankdir=LR
    node [shape=record];

    ext4_inode [label="<0> ext4_inode|
      ...|
      <1> i_size_lo 0x4| 
      <2> i_block[EXT4_N_BLOCKS] 0x28|
      ..."];
    
    ext4_extent [label="<0> ext4_extent|
      =======|
      <1> ee_block 4|
      <2> ee_len 2| 
      <3> ee_start_hi 2|
      <4> ee_start_lo 4
      "];

    ext4_extent_header [label="<0> ext4_extent_header|
      =======|
      <1> eh_magic 2|
      <2> eh_entries 2| 
      <3> eh_max 2|
      <4> eh_depth 2 |
      <5> eh_generation 4
      "];

    ext4_extent_idx [label="<0> ext4_extent_idx|
      =======|
      <1> ei_block 4|
      <2> ei_leaf_lo 4| 
      <3> ei_leaf_hi 2|
      <4> ei_unused 2
      "];
    ext4_inode:2 -> ext4_extent_header:0
    ext4_extent_header:2 -> ext4_extent:0
    ext4_extent_header:4 -> ext4_extent_idx:0
    ext4_extent_idx:2 -> ext4_extent:0
}

begonia:/storage/emulated/0/downloaded_rom # stat miui_BEGONIA_9.12.25_8e8556132d_10.0.zip                                                                                                                                                                                 
  File: miui_BEGONIA_9.12.25_8e8556132d_10.0.zip
  Size: 2128126848	 Blocks: 4156520	 IO Blocks: 512	regular file
Device: 20h/32d	 Inode: 2728207	 Links: 1
Access: (0660/-rw-rw----)	Uid: (    0/    root)	Gid: ( 1015/sdcard_rw)
Access: 2019-12-25 16:24:52.784175650 +0800
Modify: 2019-12-25 16:28:05.404175662 +0800
Change: 2019-12-25 16:28:05.404175662 +0800

Inodes per group:         8192
Inode blocks per group:   512
2728207/8192=333
2728207-2727936=271
271/16 = 16 | 15

Group 333: (Blocks 10911744-10944511) csum 0xed26 [ITABLE_ZEROED]
  Block bitmap at 10911744 (+0)
  Inode bitmap at 10911745 (+1)
  Inode table at 10911746-10912257 (+2)
  31180 free blocks, 7861 free inodes, 185 directories, 7857 unused inodes
  Free blocks: 10913304-10913479, 10913483, 10913495-10913775, 10913790-10944511
  Free inodes: 2728202, 2728264-2728265, 2728271-2736128

10911746 + 16 = 10911762

# ext4_extent_header
00000e00: b481 ff03 80a3 d87e d41c 035e 951d 035e  .......~...^...^
00000e10: 951d 035e 0000 0000 ff03 0100 686c 3f00  ...^........hl?.
00000e20: 0008 0800 0100 0000 0af3 0100 0400 0100  ................
00000e30: 0000 0000 0000 0000 1686 a600 0000 6e00  ..............n.
00000e40: 6053 0000 2902 0000 60f3 6e00 1056 0000  `S..)...`.n..V..
00000e50: b103 0000 10f6 6e00 205e 0000 8901 0000  ......n. ^......
00000e60: 20fe 6e00 f9eb 060d 0e82 a600 0000 0000   .n.............
00000e70: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000e80: 2000 0000 b8ec 5c60 b8ec 5c60 8848 f6ba   .....\`..\`.H..
00000e90: d41c 035e 8848 f6ba 0000 0000 0000 0000  ...^.H..........
00000ea0: 0000 02ea 0109 4000 0000 0000 1c00 0000  ......@.........
00000eb0: 0000 0000 6300 0000 0000 0000 0000 0000  ....c...........
00000ec0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000ed0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000ee0: 0000 0000 017f 0400 122e 4f7e 6754 5b11  ..........O~gT[.
00000ef0: 4101 d984 1c36 1eac 9a7d 4123 9edb 6677  A....6...}A#..fw

# 0x7ed8a380 = 2128126848  证明正好是这个文件

00000e20: 0008 0800 0100 0000 || 0af3 0100 0400 0100  ................
entry 项 1, 最多4个extent, depth为1
# ext4_extent_idx
0000 0000 1686 a600 0000 6e00
# 指向0xa68616 = 10913302
# 10913302块  entry 0x14  max entry 0x0154=340 340*12 = 4080, 除去extent_header的12个字节, 正好不超过4096, 证明这个extent block可以占满整个4k block
00000000  0a f3 14 00 54 01 00 00  00 00 00 00 ||  00 00 00 00

# ext4 extent block
00000000  0a f3 14 00 54 01 00 00  00 00 00 00 00 00 00 00  |....T...........|
00000010  00 60 00 00 00 a0 6e 00  00 60 00 00 00 28 00 00  |.`....n..`...(..|
00000020  00 08 6f 00 00 88 00 00  00 18 00 00 00 98 6d 00  |..o...........m.|
00000030  00 a0 00 00 00 48 00 00  00 b8 6d 00 00 e8 00 00  |.....H....m.....|
00000040  00 48 00 00 00 38 6e 00  00 30 01 00 00 50 00 00  |.H...8n..0...P..|
00000050  00 30 6f 00 00 80 01 00  00 78 00 00 00 88 6f 00  |.0o......x....o.|
00000060  00 f8 01 00 00 78 00 00  00 08 70 00 00 70 02 00  |.....x....p..p..|
00000070  00 78 00 00 00 88 70 00  00 e8 02 00 00 78 00 00  |.x....p......x..|
00000080  00 08 71 00 00 60 03 00  00 78 00 00 00 88 71 00  |..q..`...x....q.|
00000090  00 d8 03 00 00 78 00 00  00 08 72 00 00 50 04 00  |.....x....r..P..|
000000a0  00 78 00 00 00 88 72 00  00 c8 04 00 00 78 00 00  |.x....r......x..|
000000b0  00 08 73 00 00 40 05 00  00 78 00 00 00 88 73 00  |..s..@...x....s.|
000000c0  00 b8 05 00 00 78 00 00  00 08 74 00 00 30 06 00  |.....x....t..0..|
000000d0  00 78 00 00 00 88 74 00  00 a8 06 00 00 78 00 00  |.x....t......x..|
000000e0  00 08 75 00 00 20 07 00  00 78 00 00 00 88 75 00  |..u.. ...x....u.|
000000f0  00 98 07 00 8b 55 00 00  00 08 76 00 00 e9 07 00  |.....U....v.....|
00000100  8b 04 00 00 00 59 76 00  00 e9 07 00 8b 04 00 00  |.....Yv.........|
00000110  00 59 76 00 00 e9 07 00  8b 04 00 00 00 59 76 00  |.Yv..........Yv.|
00000120  00 e9 07 00 c1 03 00 00  00 59 76 00 00 00 00 00  |.........Yv.....|
# 来看第一个extent块
00000000  ................................ ||  00 00 00 00  |....T...........|
00000010  00 60 00 00 00 a0 6e 00 || ...


# 24576个连续块 0x6ea000 7249920开始 从block.map中验证发现是正确的
#  cat /cache/recovery/block.map 
7249920 7274496

# 第二个extent块
00000010  .... || 00 60 00 00 00 28 00 00  |.`....n..`...(..|
00000020  00 08 6f 00 || ...  |..o...........m.|

# 第一个 0x6000表示 第 24576个块, 正好跟第一个区段连起来.  0x2800 该区段包含10240个块, 从 0x6f0800 7276544开始
#  cat /cache/recovery/block.map 
7276544 7286784

# 第三个
00 88 00 00  00 18 00 00 00 98 6d 00
# 34816 6144  7182336


#略过, 到最后
00 98 07 00 8b 55 00 00  00 08 76 00
# 0x079800 497664  21899  0x765900 7735296
497664 + 21899 = 2128130048 (2128126848 正好处于 497664 + [21898-21899]之间), 没占满最后一个block

# 注意前面描述的 entry一共20个, 0x00-0xff (16*16-12)/12, 结尾为0xfb 0-251

1.3.2.1. 小结

综上, ota包的extent索引, tree的depth为1, 引用了一个extent_idx, entent_idx引用的块中包含了一个extent的block, 该block内描述了0x14个extent段, 正好覆盖了全部的 data_block.
在解析extent_header时, 注意其包含entry的数目, 从eh_entries数目可以判断接下来的哪些ext4_extent是有效的, 哪些ext4_extent是无效的

间接索引4kextent block的布局

digraph G {
    rankdir=UT
    node [shape=record];
    ext4_extent_block [label="<0> ext4_entent_header 12|
      <1> ext4_entent 12|
      <3> ... |
      <2> ext4_entent 12| 
      <4> tail 4
      "];
}

此处从间接索引extent block entry数目为0x14 = 20, 才会建立b tree, 20>4, 建立了一级索引, depth为1, 另外除了根据eh_entries数目从上往下判断哪些entry有效的方法外, 还需要注意
extent的成员变量ee_len，需要说明的是：如果该值<=32768,那么这个extent已经初始化的。如果该值>32768,这个extent还没有初始化

32位的校验值可以存放在extent block的tail中,正好4个字节, 占满整个4k 块.

1.4. read调用流程

遗留问题:
上一节open的调用过程中, 关于inode的i_mapping, 并未发现分配内存的地方, buffer_head跟page_cache以及inode/dentry的关系是什么样的?

带着这个问题, 我们继续看下read的调用流程

+- SYSCALL_DEFINE3 <read, unsigned int, fd, char __user *, buf, size_t, count> #传入参数 fd, buf, count
    \ +- ksys_read(fd, buf, count)
        \ - fdget_pos(fd)  # "__fget_light -> fcheck_files(files, fd) current->files->fdt->fd[fd]  #fd的flags 与引用计数有关, FDPUT_FPUT |  FDPUT_POS_UNLOCK(open带FMODE_ATOMIC_POS参数时保证原子性)                                                                        
                             #[read分析](https://ithelp.ithome.com.tw/articles/10184552)"
        | - pos = file_pos_read <f.file>  #获取文件读位置的偏移量 要读/写的offset
        | +- vfs_read <f.file, buf, count, &pos>
        |    \ - rw_verify_area(READ, file, pos, count)  #检测文件部分是否有冲突的强制锁
        |    |+- __vfs_read(file, buf, count, pos)
        |    |    \ +- "file->f_op->read" -  "file->f_op->read(file, buf, count, pos)" #如果有注册了f_op中read方法, 使用read函数  file_operations ext4_file_operations
        |    |    | +- "file->f_op->read_iter" - new_sync_read(file, buf, count, pos) # 注册了fop中的read_iter方法, 使用new_sync_read方法
        |    |    |    \ - init_sync_kiocb(&kiocb, filp) #ext4 只注册了 read_iter方法, 根据记录即将进行I/O操作的完成状态, 将file 的字段部分封装到了kiocb中
        |    |    |    | - iov_iter_init(&iter, READ, &iov, 1, len) #来初始化iov_iter, kiocb表示io control block, 用来跟踪记录IO操作的完成状态, iov_iter iov_iter用来从用户和内核之间传递数据用 
        |    |    |    |+- call_read_iter(filp, &kiocb, &iter)  # file->f_op->read_iter(kio, iter)
        |    |            \ +- ext4_file_read_iter <struct kiocb *iocb, struct iov_iter *iter>
        |    |               \ - ext4_forced_shutdown "<EXT4_SB(file_inode(iocb->ki_filp)->i_sb))>" #    检测下超级块相关flag中的EXT4_FLAGS_SHUTDOWN位, 如果是SHUTDOWN, 返回EIO 错误
        |    |               | - IS_DAX "(file_inode(iocb->ki_filp))" # 判断内核是否配置了`CONFIG_FS_DAX`(Direct access),以及文件的打开方式是否是直接访问设备，这个直接影响访问是否绕过pagecache.
                                                     #如果配置了CONFIG_FS_DAX ，且文件打开方式指定了直接访问，那么则调用`ext4_dax_read_iter` 函数。否则调用generic_file_read_iter函数. 需要inode支持直接访问
             |               | +- generic_file_read_iter(iocb, iter) <filemap.c >
             |                  \ +- "iocb->ki_flags & IOCB_DIRECT)" 
             |                      \ - filemap_write_and_wait_range(mapping, iocb->ki_pos, iocb->ki_pos + count - 1) #判断上次写操作是否需要 filemap_write_and_wait_range函数同步,确保读到的数据是最新的
             |                      |+- "mapping->a_ops->direct_IO(iocb, iter)" # 直接访问有关, 直读模式, 如果目标文件定义了O_DIRECT标志，则直接跳过缓冲层，
                                        #使用generic_file_direct_IO函数将读请求直接传递到块设备驱动层，没有定义则调用 generic_file_buffered_read, 
                                        # "[ext4_read调用流程](https://zhuanlan.zhihu.com/p/36897326)"
             |                       | +- blkdev_direct_IO <struct kiocb *iocb, struct iov_iter *iter> 
             |                           \ - __blkdev_direct_IO <struct kiocb *iocb, struct iov_iter *iter, int nr_pages> # bio request相关                                       
             |                  | +- "非直读模式"- generic_file_buffered_read <iocb, iter, retval> # 经过page-cache, 循环在内存中寻找所读内容是否在内存中缓存, 使用offset index控制读取inode所属的所有pages
             |                       \ - find_get_page(mapping, index) #查找page在page tree中是否命中 index是文件位置指针转换的对应的page页面号 mapping是inode的i_mapping (index = *ppos >> PAGE_SHIFT)
             |                       |+- "page 未命中" 
             |                           \ +- page_cache_sync_readahead(mapping, ra, filp, index, last_index - index);  # 会从磁盘中读取页，并进行预读。
                                                            #此外，还要判断页是否是最新，以免读到脏数据；
                                                            #如果非最新则需要 调用address_space_operations中readpage函数进行读操作获取最新页, 读页的函数最后都会调用submit_bio
             |                           | - find_get_page(mapping, index)               
             |                           | +- "page 未命中"                     
             |                               \ +-"内存中已经没有多余的page cache"-  page = page_cache_alloc(mapping) # 分配page-cache
             |                                   \ - alloc_pages(gfp, 0) # 
             |                               | - add_to_page_cache_lru(page, mapping, index,) # 将page cache挂入lru, 同时将新分配的page-cache与mapping建立绑定关系, page的mapping节点指向mapping, 
                                                         #而mapping的绑定的tree里面挂入page                                                   
             |                               | +- goto readpage
                                                 \ +- "mapping->a_ops->readpage(filp, page)" #调用adress_space的 readpage函数
                                                     \ +- blkdev_readpage # 没有定义的话, 使用默认的 def_blk_aops里的 blkdev_readpage 函数
                                                         \+- blkdev_readpage # 磁盘读操作, 加载进page cache, 通用 read_page函数
                                                             \+- block_read_full_page(page, blkdev_get_block) #Generic "read page" function for block devices 
                                                                     #that have the normal get_block functionality.
                                                                  \- head = create_page_buffers(page, inode, 0) #创建buffer_head
                                                                  |- lblock = (i_size_read(inode)+blocksize-1) >> bbits; #inode 文件的最后一个扇区号(按inode 的i_size计算的), 即文件一共有多少个扇区
                                                                  |- iblock = (sector_t)page->index << (PAGE_SHIFT - bbits); #inode 对应的page tree, page tree上挂着多个page, page上又挂着多个
                                                                                     #bh(4k-512), 这个iblock的意思应该是获取当前正在访问的block在inode中的所有的块中的block扇区偏移号
                                                                  |- get_block(inode, iblock, bh, 0); #调用传入的blkdev_get_block 取出block数据到bh中. iblock对应块号 bh->b_blocknr = iblock; 
                                                                                                      # b_bdev 块设备上 buffer 所关联的块的起始地址
                                                                  |+- submit_bh(READ, bh); # 如果buffer_uptodate(bh) 如果缓冲区已经建立了与块的映射，但是其内容不是最新的则将缓冲区放置到一个临时的数组中,
                                                                                  # 调用循环, for (i = 0; i < nr; i++) submit_bh(READ, bh) , 将所有需要读取的缓冲区转交给bio层, 开始读操作
                                                                     \ +- submit_bh(REQ_OP_READ, 0, bh);   submit_bh_wbc #根据bh page初始化bio, 可见buffer_head是一个page-bio的中转媒介
                                                                         \ - submit_bio(bio); #转发到bio, 开始读操作
             |                                       | +-"ext4 read page"- ext4_readpage # 优化的文件系统层通常不会用默认的read_page函数, 因为磁盘数据有特殊的组织方式
                                                         \ - ext4_mpage_readpages(page->mapping, NULL, page, 1, false)                                                     
                                                 | - PageUptodate  # 虽然页在缓存中了，但是其数据不一定是最新的，这里通过PageUptodate(page)来检查, 如果不是最新的, 则重新进行find page流程.
                                                 | - goto page_ok - # 缓存的数据是最新的情况, 进入page_ok阶段
                                                     \ - flush_dcache_page(page) # 处理内存别名 [cachetlb](https://www.kernel.org/doc/Documentation/cachetlb.txt)
                                                     | +-copy_page_to_iter(page, offset, nr, iter) #将内存中数据拷贝到用户空间 <iov_iter.c>
                                                         \ +- copy_page_to_iter_iovec(page, offset, bytes, i)
                                                             \ - iov = i->iov；buf = iov->iov_base; kaddr = kmap(page); from = kaddr + offset; copyout(buf, from, copy)； #从from中拷贝到buf下, 最终                         
                                                                                 #传给iov_iter 
             | - fsnotify_access  #inotify 文件访问事件
        | -  file_pos_write <f.file, pos> #写回offset, 便于下次读时指定offset lseek可以修改 close归0
        | -  fdput_pos(f) # 引用计数 -1

struct bvec_iter {
	sector_t		bi_sector;	/* device address in 512 byte  sectors */
	unsigned int		bi_size;	/* residual I/O count */
	unsigned int		bi_idx;		/* current index into bvl_vec */
	unsigned int       bi_done;	/* number of bytes completed */
	unsigned int       bi_bvec_done;	/* number of bytes completed icurrent bvec */
	u64			bi_dun;		/* DUN setting for bio */
};

struct bio {   // bio 结构直接处理 page 和地址空间，而不是 buffer
	struct bio		*bi_next;	/* request queue link */
	struct gendisk		*bi_disk;
	unsigned int		bi_opf;		/* bottom bits req flags,
						 * top bits REQ_OP. Use
						 * accessors.
						 */
	unsigned short		bi_flags;	/* status, etc and bvec pool number */
	unsigned short		bi_ioprio;
	unsigned short		bi_write_hint;
	blk_status_t		bi_status;
	u8			bi_partno;

	/* Number of segments in this BIO after
	 * physical address coalescing is performed.
	 */
	unsigned int		bi_phys_segments;

	/*
	 * To keep track of the max segment size, we account for the
	 * sizes of the first and last mergeable segments in this bio.
	 */
	unsigned int		bi_seg_front_size;
	unsigned int		bi_seg_back_size;

	struct bvec_iter	bi_iter;

	atomic_t		__bi_remaining;
	bio_end_io_t		*bi_end_io;

	void			*bi_private;
#ifdef CONFIG_BLK_CGROUP
	/*
	 * Optional ioc and css associated with this bio.  Put on bio
	 * release.  Read comment on top of bio_associate_current().
	 */
	struct io_context	*bi_ioc;
	struct cgroup_subsys_state *bi_css;
	struct blkcg_gq		*bi_blkg;
	struct bio_issue	bi_issue;
#endif
	union {
#if defined(CONFIG_BLK_DEV_INTEGRITY)
		struct bio_integrity_payload *bi_integrity; /* data integrity */
#endif
	};
#ifdef CONFIG_PFK
	/* Encryption key to use (NULL if none) */
	const struct blk_encryption_key	*bi_crypt_key;
#endif
#ifdef CONFIG_DM_DEFAULT_KEY
	int bi_crypt_skip;
#endif

	unsigned short		bi_vcnt;	/* how many bio_vec's */

	/*
	 * Everything starting with bi_max_vecs will be preserved by bio_reset()
	 */

	unsigned short		bi_max_vecs;	/* max bvl_vecs we can hold */

	atomic_t		__bi_cnt;	/* pin count */

	struct bio_vec		*bi_io_vec;	/* the actual vec list */

	struct bio_set		*bi_pool;
#ifdef CONFIG_PFK
	struct inode		*bi_dio_inode;
#endif
	/*
	 * We can inline a number of vecs at the end of the bio, to avoid
	 * double allocations for a small number of bio_vecs. This member
	 * MUST obviously be kept at the very end of the bio.
	 */
	struct bio_vec		bi_inline_vecs[0];
};

struct buffer_head {   //buffer_head.h  表示在内存中的磁盘块
	unsigned long b_state;		/* buffer state bitmap (see above) */ 这段 buffer 的状态
	struct buffer_head *b_this_page;/* circular list of page's buffers */
	struct page *b_page;		/* the page this bh is mapped to */   指向的内存页即为 buffer 所映射的页
	sector_t b_blocknr;		/* start block number */    b_bdev 块设备上 buffer 所关联的块的起始地址
	size_t b_size;			/* size of mapping */
	char *b_data;			/* pointer to data within the page */  b_data 为指向块的指针（在 b_page 中），并且长度为 b_size
	struct block_device *b_bdev;   b_bdev 表示关联的块设备
	bh_end_io_t *b_end_io;		/* I/O completion */
 	void *b_private;		/* reserved for b_end_io */
	struct list_head b_assoc_buffers; /* associated with another mapping */
	struct address_space *b_assoc_map;	/* mapping this buffer is
						   associated with */
	atomic_t b_count;		/* users using this buffer_head */ b_count 为 buffer 的引用计数 它通过 get_bh、put_bh 函数进行原子性的增加和减小
};

struct page {   //mm_types.h
	unsigned long flags;		/* Atomic flags, some possibly
					 * updated asynchronously */
	/*
	 * Five words (20/40 bytes) are available in this union.
	 * WARNING: bit 0 of the first word is used for PageTail(). That
	 * means the other users of this union MUST NOT use the bit to
	 * avoid collision and false-positive PageTail().
	 */
	union {
		struct {	/* Page cache and anonymous pages */
			/**
			 * @lru: Pageout list, eg. active_list protected by
			 * zone_lru_lock.  Sometimes used as a generic list
			 * by the page owner.
			 */
			struct list_head lru;
			/* See page-flags.h for PAGE_MAPPING_FLAGS */
			struct address_space *mapping;
			pgoff_t index;		/* Our offset within mapping. */
			/**
			 * @private: Mapping-private opaque data.
			 * Usually used for buffer_heads if PagePrivate.
			 * Used for swp_entry_t if PageSwapCache.
			 * Indicates order in the buddy system if PageBuddy.
			 */
			unsigned long private;
		};
		struct {	/* slab, slob and slub */
			union {
				struct list_head slab_list;	/* uses lru */
				struct {	/* Partial pages */
					struct page *next;
					int pages;	/* Nr of pages left */
					int pobjects;	/* Approximate count */
				};
			};
			struct kmem_cache *slab_cache; /* not slob */
			/* Double-word boundary */
			void *freelist;		/* first free object */
			union {
				void *s_mem;	/* slab: first object */
				unsigned long counters;		/* SLUB */
				struct {			/* SLUB */
					unsigned inuse:16;
					unsigned objects:15;
					unsigned frozen:1;
				};
			};
		};
		struct {	/* Tail pages of compound page */
			unsigned long compound_head;	/* Bit zero is set */

			/* First tail page only */
			unsigned char compound_dtor;
			unsigned char compound_order;
			atomic_t compound_mapcount;
		};
		struct {	/* Second tail page of compound page */
			unsigned long _compound_pad_1;	/* compound_head */
			unsigned long _compound_pad_2;
			struct list_head deferred_list;
		};
		struct {	/* Page table pages */
			unsigned long _pt_pad_1;	/* compound_head */
			pgtable_t pmd_huge_pte; /* protected by page->ptl */
			unsigned long _pt_pad_2;	/* mapping */
			union {
				struct mm_struct *pt_mm; /* x86 pgds only */
				atomic_t pt_frag_refcount; /* powerpc */
			};
#if ALLOC_SPLIT_PTLOCKS
			spinlock_t *ptl;
#else
			spinlock_t ptl;
#endif
		};
		struct {	/* ZONE_DEVICE pages */
			/** @pgmap: Points to the hosting device page map. */
			struct dev_pagemap *pgmap;
			unsigned long hmm_data;
			unsigned long _zd_pad_1;	/* uses mapping */
		};

		/** @rcu_head: You can use this to free a page by RCU. */
		struct rcu_head rcu_head;
	};

	union {		/* This union is 4 bytes in size. */
		/*
		 * If the page can be mapped to userspace, encodes the number
		 * of times this page is referenced by a page table.
		 */
		atomic_t _mapcount;

		/*
		 * If the page is neither PageSlab nor mappable to userspace,
		 * the value stored here may help determine what this page
		 * is used for.  See page-flags.h for a list of page types
		 * which are currently stored here.
		 */
		unsigned int page_type;

		unsigned int active;		/* SLAB */
		int units;			/* SLOB */
	};

	/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
	atomic_t _refcount;
};

struct address_space {
	struct inode		*host;		/* owner: inode, block_device */
	struct radix_tree_root	i_pages;	/* cached pages */
	atomic_t		i_mmap_writable;/* count VM_SHARED mappings */
	struct rb_root_cached	i_mmap;		/* tree of private and shared mappings */
	struct rw_semaphore	i_mmap_rwsem;	/* protect tree, count, list */
	/* Protected by the i_pages lock */
	unsigned long		nrpages;	/* number of total pages */
	/* number of shadow or DAX exceptional entries */
	unsigned long		nrexceptional;
	pgoff_t			writeback_index;/* writeback starts here */
	const struct address_space_operations *a_ops;	/* methods */
	unsigned long		flags;		/* error bits */
	spinlock_t		private_lock;	/* for use by the address_space */
	gfp_t			gfp_mask;	/* implicit gfp mask for allocations */
	struct list_head	private_list;	/* for use by the address_space */
	void			*private_data;	/* ditto */
	errseq_t		wb_err;
} __attribute__((aligned(sizeof(long)))) __randomize_layout;

struct block_device {
	dev_t			bd_dev;  /* not a kdev_t - it's a search key */
	int			bd_openers;
	struct inode *		bd_inode;	/* will die */
	struct super_block *	bd_super;
	struct mutex		bd_mutex;	/* open/close mutex */
	void *			bd_claiming;
	void *			bd_holder;
	int			bd_holders;
	bool			bd_write_holder;
#ifdef CONFIG_SYSFS
	struct list_head	bd_holder_disks;
#endif
	struct block_device *	bd_contains;
	unsigned		bd_block_size;
	u8			bd_partno;
	struct hd_struct *	bd_part;
	/* number of times partitions within this device have been opened. */
	unsigned		bd_part_count;
	int			bd_invalidated;
	struct gendisk *	bd_disk;
	struct request_queue *  bd_queue;
	struct backing_dev_info *bd_bdi;
	struct list_head	bd_list;
	/*
	 * Private data.  You must have bd_claim'ed the block_device
	 * to use this.  NOTE:  bd_claim allows an owner to claim
	 * the same device multiple times, the owner must take special
	 * care to not mess up bd_private for that case.
	 */
	unsigned long		bd_private;

	/* The counter of freeze processes */
	int			bd_fsfreeze_count;
	/* Mutex for freeze */
	struct mutex		bd_fsfreeze_mutex;
} __randomize_layout;

struct kiocb {
	struct file		*ki_filp;
	loff_t			ki_pos;
	void (*ki_complete)(struct kiocb *iocb, long ret, long ret2);
	void			*private;
	int			ki_flags;
	u16			ki_hint;
	u16			ki_ioprio; /* See linux/ioprio.h */
};
struct bio_vec {
	struct page	   *bv_page;    块所在的页
	unsigned int	bv_len;     块的长度
	unsigned int	bv_offset;  块相对页的偏移量
};

1.4.1. bio buffer_head page inode的关系

digraph G {
    rankdir=LR
    node [shape=record];

    buffer_head [label=" <0> buffer_head|
      ======|
      <1>page* b_page|
      <2> *b_this_page|
      <3> *b_bdev |
      <4> *b_end_io |
      <5> *b_private |
      <6> *b_assoc_map|
      <7> b_blocknr |
      <8> b_size |
      ..."]


    bio [label=" <0> bio|
      ======|
      <1> *bi_next|
      <2> *bi_disk|
      <3> *b_bdev |
      <4> *b_end_io |
      <5> *b_private |
      <6> *b_assoc_map|
      <7> *bi_crypt_key |
      <8> *bi_io_vec |
      <9> bvec_iter	bi_iter |
      <10> bi_vcnt|
      <11> bi_idx |
      ..."]

    page [label=" <0> page|
      ======|
      <1> *mapping|
      <2> list_head slab_list|
      <3> page *next |
      <4> pages |
      <5> kmem_cache *   slab_cache |
      <6> dev_pagemap * pgmap|
      <7> rcu_head |
      page_type |
      active |
      ..."]


    address_space[label="<0> address_space|
      =======|
      <1>  inode *host|
      radix_tree_root	 i_pages|
      nrpages|
      <2> address_space_operations *a_ops |
      ...
      "]

    inode [label="<0> inode|
      ======|
      i_mode|
      <1> i_mapping|
      ...|
      <2> i_sb
      "]

    bio_vec [label="<0> bio_vec|
      ======|
      <1> *bv_page|
      <2> bv_len|
      <3> bv_offset |
      ...
      "]

    bvec_iter [label="<0> bvec_iter|
      ======|
      <1> bi_sector|
      <2> bi_size|
      <3> bi_done |
      ...
      "]

    inode:1 -> address_space:0
    address_space:1 -> inode:0
    page:1 -> address_space:0
    page:1 -> inode:1
    page:3 -> page:0 [style=dotted]
    bio:1 -> bio:0 [style=dotted, color=blue]
    buffer_head:1 -> page:0
    bio:8 -> bio_vec:0 
    bio_vec:1 -> page:0
    bio:9 -> bvec_iter:0
    bvec_iter:1 -> buffer_head:7 [style=dotted, color=blue]
    bvec_iter:1 -> buffer_head:8 [style=dotted, color=blue]
    bio:3 -> buffer_head:3 [style=dotted, color=green]
    bio:5 -> buffer_head:0 [style=dotted color=blue]
    bio:10 -> bio:8 
    bio:11 -> bio:8 [headlabel="当前\nio操作在\n bi_io_vec\n数组中的索引" color=green]
}

buffer_head和bio关系在submit_bh()函数中可以充分体现：(也就是说只有在page中的块不连续时，buffer_head和bio才建立关系？)
Linux kernel学习-block层

当块设备中的一个块（一般为扇区大小的整数倍，并不超过一个内存 page 的大小）通过读写等方式存放在内存中，一般被称为存在 buffer 中，每个 buffer 和一个块相关联，它就表示在内存中的磁盘块。kernel 因此需要有相关的控制信息来表示块数据，每个块与一个描述符相关联，这个描述符就被称为 buffer head，并用 struct buffer_head 来表示

在 Linux 2.6 版本以前，buffer_head 是 kernel 中非常重要的数据结构，它曾经是 kernel 中 I/O 的基本单位（现在已经是 bio 结构）
它曾被用于为一个块映射一个页，它被用于描述磁盘块到物理页的映射关系，所有的 block I/O 操作也包含在 buffer_head 中。但是这样也会引起比较大的问题：

buffer_head 结构过大（现在已经缩减了很多），用 buffer head 来操作 I/O 数据太复杂，kernel 更喜欢根据 page 来工作（这样性能也更好）；
一个大的 buffer_head 常被用来描述单独的 buffer，而且 buffer 还很可能比一个页还小，这样就会造成效率低下；
buffer_head 只能描述一个 buffer，这样大块的 I/O 操作常被分散为很多个 buffer_head，这样会增加额外占用的空间。
因此 2.6 开始的 kernel （实际 2.5 测试版的 kernel 中已经开始引入）使用 bio 结构直接处理 page 和地址空间，而不是 buffer。

bio通过 bio_get、bio_put 宏可以对 bi_cnt 进行增加和减小操作
bio 结构中最重要的是 bi_vcnt、bi_idx、bi_io_vec 等成员，bi_vcnt 为 bi_io_vec 所指向的 bio_vec 类型列表个数，bi_io_vec 表示指定的 block I/O 操作中的单独的段（如果你用过 readv 和 writev 函数那应该对这个比较熟悉），bi_idx 为当前在 bi_io_vec 数组中的索引，随着 block I/O 操作的进行，bi_idx 值被不断更新，kernel 提供 bio_for_each_segment 宏用于遍历 bio 中的 bio_vec。另外 kernel 中的 MD 软件 RAID 驱动也会使用 bi_idx 值来将一个 bio 请求分发到不同的磁盘设备上进行处理。

每个 bio_vec 类型指向对应的 page，bv_page 表示它所在的页，bv_offset 为块相对于 page 的偏移量，bv_len 即为块的长度。

1.4.1.1. buffer_head 和 bio 总结：

因此也可以看出 block I/O 请求是以 I/O 向量的形式进行提交和处理的。
bio 相对 buffer_head 的好处有：

bio 可以更方便的使用高端内存，因为它只与 page 打交道，并不直接使用地址。
bio 可以表示 direct I/O（不经过 page cache，后面再详细描述）。
对向量形式的 I/O（包括 sg I/O）支持更好，防止 I/O 被打散。
但是 buffer_head 还是需要的，它用于映射磁盘块到内存，因为 bio 中并没有包含 kernel 需要的 buffer 状态的成员以及一些其它信息。