【CS130】Final Review
Cheatsheet-Final
4. Scheduling
4.0. Overview
Definition
Scheduling:Deciding which threads are given access to resources from moment to moment.
- Often, we think in terms of CPU time, but could also think about access to resources like network BW or disk access.
Scheduling Assumptions
- One program per user
- One thread per program
- Programs are independent
Scheduling Policy Goals / Criteria(调度的主要目的)
- Minimize Response Time(最小化响应时间)
- Minimize elapsed time to do an operation / job
- Maximize Throughput(最大化吞吐量)
- Throughput related to response time, but not identical:
- Minimizing response time will lead to more context switching than if you only maximized throughput
- Two parts to maximizing throughput
- Minimize overhead(最小化调度的开销)
- Efficient use of resources(最大化利用资源)
- Throughput related to response time, but not identical:
- Fairness
- Share CPU among users in some equitable way
- 程序既使用CPU,也使用I/O设备。操作系统通过调度算法决定下一个要在CPU上执行的作业。时间片轮转调度算法确保每个线程都能获得一定的CPU时间。某些I/O活动可能会被视为计算,因为CPU仍在使用中。I/O指的是进程进入阻塞状态,等待外部设备完成工作。
Scheduling Algorithms
- First Come, First Served
- Round Robin
- Priority
- Multi Level Queue
- Real-Time Scheduling
4.1. FCFS Scheduling
Ideas
- First-Come, First Served
- Convoy Effort: short process behind long process
Example
4.2. Round Robin Scheduling
Ideas
- Idea:Preemption
- Each process gets a small unit of CPU time quantum, usually 10-100ms
- After quantum expires, the process is preempted and added to the end of the ready queue
- $n$ process in ready queue and time quantum is $q$
- Each process get $\frac{1}{n}$ of the CPU time
- In chunks of at most $q$ time units
- No process waits more than $(n-1)\cdot q$ time units
- Performance
- $q$ large → FCFS
- $q$ small → Interleaved(交错调度)(Really small → HT,超线程)
- $q$ must be large with respect to context switch, otherwise overhead is too high.
- How to choose time slices
- What if too big?
- Response time suffers
- What if infinite?
- Get back to FCFS
- What if too small?
- Through suffers
- Actually choices:
- In practice, need to balance short job performance and long job throughput
- Typical time slice today is between 10 - 100 ms
- Typical context-switching overhead is 0.1 - 1 ms
- Roughly 1% overhead due to context-switching
- What if too big?
Example
4.3. Priority Scheduling
- A priority value (int.) is associated with eachprocess.
- Based on:
- Cost to user
- Importance to user
- Aging
- %CPU time used in last XX hours
- Execution Plan
- Always execute highest-priority runnable jobs to completion
- Each queue can be processed in RR with some time quantum
- Problem
- Starvation
- Deadlock
- How to fix problems
- Dynamic Properties
- Adjust base-level priority up or down based on heuristicsabout interactivity, locking, burst behavior, etc…
- Dynamic Properties
- How to implement fairness?
- Give each queue some fraction of the CPU
- Increase priority of jobs that don’t get service
4.4. Lottery Scheduling
Definition
- Give each job some number of lottery tickets
- On each time slice, randomly pick a winning ticket
- On average, CPU time is proportional to number of tickets given to each job
Tickets Assignment
- SRTF: Short Running Jobs get more, long running jobs get fewer
- to avoid starvation, every job gets at least one ticket
How to evaluate a scheduling algorithm?
- Deterministic modeling
- Queueing models
- Implementation / Simulation
4.6. What if We Knew the Future?
Algorithms
- Shortest Job First (SJF)
- Shortest Remaining Time First (SRTF)
- Sometimes called “Shortest Remaining Time to Completion First” (SRTCF)
Discussion
SJF / SRTF are the best you can do at minimizing average response time
Provably optimal !
SRTF is always at least as good as SJF, we focus on it.
Comparison of SRTF with FCFS
- What if all jobs the same length?
- SRTF becomes the same as FCFS (i.e. FCFS is best can do if all jobs the same length)
- What if jobs have varying length?
- SRTF: short jobs not stuck behind long ones
- What if all jobs the same length?
Benefits of SRTF
- Starvation
- SRTF can lead to starvation if many small jobs!
- Large jobs never run
- Somehow need to predict future
- Some systems ask the user
- SRTF Pros & Cons
- Pros: Optimal (average response time)
- Cons: Hard to predict future; unfair
Predicting the length of the next CPU burst
- Adaptive: Changing policy based on past behavior
- CPU scheduling, in virtual memory, in file systems, etc
- Works because programs have predictable behavior
- If program was I/O bound in past, likely in future
- If computer behavior were random, wouldn’t help
4.7. Multi-level Queue
- Ready queue partitioned into separate queues
- e.g. system processes. foreground(interactive), background(batch), student processes…
- Each queue has its own scheduling algorithm
- e.g. foreground(RR), background(FCFS)
- Processes assigned to one queue permanently
- Scheduling must be done between the queues
- Fixed priority
- Time slice: Each queue get some CPU time that it schedules
Multilevel Feedback Queue
- Multilevel queue, each with different priorities
- Higher priority queue ofter considered “foreground” tasks
- Each queue has its own scheduling algorithm
- e.g., Foreground - RR, background - FCFS
- A process can move between the queues
- Aging can be implemented this way
- Parameters for a multilevel feedback queue scheduler
- Number of queues
- Scheduling algorithm for each queue
- Method used to determine
- When to upgrade a process
- When to demote a process
- Which queue a process will enter when that process needs services
- Scheduling must be done between the queues
- Fixed priority scheduling
- Time slice
- Countermeasure: user action that can ruin intent for the OS designers
Linux O(1) Scheduler
Priority-based scheduler: 140 priorities
- 40 for User Tasks
- 100 for Real-time/Kernel
- Lower priority value → Higher priority
- All algorithms $O(1)$: schedule n processes in constant time
- compute time-slices/priorities/interactivity credits when job finishes time slice
- 140-bit bit mask indicates presence or absence of job(s) at given priority level
Two separate priority queues (arrays)
- Active
- Expired
- All tasks in the active queue use up their time slices and get placed on the expired queue, after which queue swapped
Time slice depends on priority - linearly mapped onto time slice range
- Like multi-level queue (one queue per priority) with different time slice at each level
- Execution split into “Time slice Granularity” chunks — RR through priority
Heuristics
- User-task priority adjusted ±5 based on heuristics
p -> sleep_avg = sleep_time - run_time
- Higher
sleep_avg
→ more I/O bound the task, more reward
- Interactive Credit
- Earned when task sleeps for “long” time, Spend when task runs for “long” time
- IC is used to provide hysteresis to avoid changing interactivity for temporary changes in behavior
- BUT, interactive tasks get special dispensation
- To try to maintain interactivity
- Placed back into active queue, unless another track has starved for too long..
Real-Time tasks
- Always preempt non-RT tasks and no dynamic adjustment of priorities
- Scheduling schemes
SCHED_FIFO
: preempts other tasks, no timeslice limitSCHED_RR
: preempts normal tasks, RR scheduling amongst tasks of same priority
4.8. Real-Time Scheduling
Efficiency is important but predictability is essential
- Hard real-time computing
- Attempt to meet all deadlines
- EDF (earliest deadline first), LLF (least laxity first)
- RMS (Rate-Monotonic Scheduling), DM (Deadline Monotonic Scheduling)
- Soft real-time computing
- Attempt to meed deadlines with high probability
- Minimize miss ratio / Maximize completion ratio
- CBS (Constant Bandwidth Server)
Issues in Real-Time Scheduling
- Dispatch Latency
- Priority Inversion and Inheritance
- priority inversion: Higher priority process needs kernel resource currently being used by another lower priority process, thus Higher priority process must wait
- Priority inheritance (Solution of Priority inversion): Low priority process now inherits high priority until it has completed use of the resource in question.
4.9. EDF (Earliest Deadline FIrst)
Tasks periodic with period $P$ and computation $C$ in each period: $(P_i, C_i)$ for each task $i$
Preemptive priority-based dynamic scheduling:
- Each task is assigned a (current) priority based on how close the absolute deadline is (i.e. $D^{t+1} = D_i^t + P_i$ for each task!)
- The scheduler always schedules the active task with the closest absolute deadline
- Schedulable when $\sum_{i=1}^n(\frac{C_i}{P_i})\leq 1$
5. Address Translation
5.1. Address & Address Space
Definition: Set of accessible addresses and the state associated with them.
What happens when processor reads or writes to an address?
- Perhaps acts like regular memory
- Perhaps causes I/O operation (Memory-mapped I/O)
- Causes program to abort (segfault)?
- Communicate with another program
Virtualizing Resources:Why worry about Resources?
为什么要担心内存共享?
- 进程和/或内核的完整工作状态由其在内存(和寄存器)中的数据定义。
- 因此,不能让不同的控制线程使用相同的内存。
- 物理上:两个不同的数据不能占用内存中的同一个位置。
- 保护机制。
- 不希望不同的线程访问彼此的内存。
5.2. Important Aspects of Memory Multiplexing
Protection
- Prevent access to private memory of other processes
- Different pages of memory can be given special behavior (e.g., Read Only, Invisible to user programs, etc.)
- Kernel data protected from user programs
- Programs protected from themselves
Controlled Overlap
- Separate state of threads should not collide in physical memory. Obviously, unexpected overlap
causes chaos! - Conversely, would like the ability to overlap when desired (for communication)
Translation
- Ability to translate accesses from one address space (virtual) to a different one (physical)
- When translation exists, processor uses virtual addresses, physical memory uses physical
addresses - Side effects:
- Can be used to avoid overlap
- Can be used to give uniform view of memory to programs
Addresses is bounded to final valuesanywhere in this path
- Depends on HW support
- Also depends on OS
5.3. Simple B&B Method
Multi-programming (Version w/ Protection)——Base & Bound
Can we protect programs from each other without translation?
- Yes: use two special registers BaseAddr and LimitAddr to prevent user from straying outside designated area
- Cause error if user tries to access an illegal address
- During switch, kernel loads new base/limit from PCB (Process Control Block)
- User not allowed to change base/limit registers
Issue with simble B&B Method
- Fragmentation problem over time
- Not every process is same size → memory becomes fragmented over time
- Missing support for sparse address space
- Would like to have multiple chunks/program (Code, Data, Stack, Heap, etc)
- Hard to do inter-process sharing
- Want to share code segments when possible
- Want to share memory between processes
- Helped by providing multiple segments per process
5.4. Multi Segment Model
- Segment map resides in processor
- Segment number mapped into base/limit pair
- Base added to offset to generate physical address
- Error check catches offset out of range
- As many chunks of physical memory as entries
- Segment addressed by portion of virtual address
- However, could be included in instruction instead:
- x86 Example: mov [es:bx],ax
Observations about Segmentation
- Virtual address space has holes
- Segmentation efficient for sparse address spaces
- A correct program should never address gaps (except as mentioned in moment)
- If it does, trap to kernel and dump core
- When it is OK to address outside valid range?
- This is how the stack and heap are allowed to grow
- For instance, stack takes fault, system automatically increases size of stack
- Need protection mode in segment table
- For example, code segment would be read-only
- Data and stack would be read-write (stores allowed)
- Shared segment could be read-only or read-write
- What must be saved/restored on context switch?
- Segment table stored in CPU, not in memory (small)
- Might store all of process’ memory onto disk when switched (called “swapping”)
Problems with Segmantation
- Must fit variable-sized chunks into physical memory
- May move processes multiple times to fit everything
- Limited options for swapping to disk
- Fragmentation: wasted space
- External: free gaps between allocated chunks
- Internal: don’t need all memory within allocated chunks
5.5. Paging
- Physical Memory in Fixed Size Chunks
- Can use simple vector of bits to handle allocation: 00110001110001101 … 110010(BitMap)
- Each bit represents page of physical memory:1 → allocated, 0 → free
- Typically have small pages (1K-16K)
- Consequently: need multiple pages/segment
How to implement Simple Paging
- Page Table (One per process)
- Resides in physical memory
- Contains physical page and permission for each virtual page
- Permissions include: Valid bits, Read, Write, etc
- Virtual address mapping
- Offset from Virtual address copied to Physical Address
- Example: 10 bit offset fi 1024-byte pages
- Virtual page # is all remaining bits
- Example for 32-bits: 32-10 = 22 bits, i.e. 4 million entries
- Physical page # copied from table into physical address
- Offset from Virtual address copied to Physical Address
- Check Page Table bounds and permissions
Where is page sharing used ?
- The “kernel region” of every process has the same page table entries
- Different processes running same binary!
- User-level system libraries (execute only)
- Shared-memory segments between different processes
Some Simple Security Measures
- Address Space Randomization
- Kernel address space isolation
5.6. Multi-Level Translation
Motivation:单层 Page Table 太大了,全部放在memory里很不方便。
6. Caching, TLB
6.1. Page Table & Page Table Entry
Two-level page table
- Tree Of Page Tables: 10b-10b-12b pattern
- Table Fixed Size(1024 Entries)
- On Context-Switch: Save single PageTablePtr Register
- Valid bits on Page Table Entries
- Don’t need every 2nd-level table
- Even when exist, 2nd-level tables can reside on disk if not in use
Page Table Entry(PTE)
Example:x86 PTE
- P: Present (same as “valid” bit in other architectures)
- W: Writeable
- U: User accessible
- PWT: Page write transparent: external cache write-through
- PCD: Page cache disabled (page cannot be cached)
- A: Accessed: page has been accessed recently
- D: Dirty (PTE only): page has been modified recently
- PS: Page Size: PS=1 → 4MB page (directory only).
- Bottom 22 bits of virtual address serve as offset
How to use PTE?
- Demand Paging
- Keep only active pages in memory
- Place others on disk and mark their PTEs invalid
- Copy on Write
- UNIX fork gives copy of parent address space to child
- Address spaces disconnected after child created
- How to do this cheaply?
- Make copy of parent’s page tables (point at same memory)
- Mark entries in both sets of page tables as read-only
- Page fault on write creates two copies(只在需要 Write 的时候,再创建新的页面)
- UNIX fork gives copy of parent address space to child
- Zero Fill On Demand
- New data pages must carry no information (say be zeroed)
- Mark PTEs as invalid; page fault on use gets zeroed page
- Often, OS creates zeroed pages in background
Multi-level Translation: Segments + Pages
What about a tree of tables?
- Lowest level page table fi memory still allocated with bitmap
- Higher levels often segmented
6.2. Inverted Page Table
IA64: 64bit addresses: Six-level page table?
- No! Too slow! Too Many Almost-Empty Tables
- 核心问题:Physical memory may be much less than Virtual Memory allocated
Inverted Page Table
Using a hash table
- Called an “Inverted Page Table”
- Size is independent of virtual address space
- Directly related to amount of physical memory
- Very attractive option for 64-bit address spaces
- PowerPC, UltraSPARC, IA64
6.3. Address Translation Comparison
MMU(内存管理单元)的位置和功能。
MMU是位于处理器和内存系统之间的一个组件。当处理器发出读取虚拟地址的请求时,请求会通过MMU转发到缓存(再传递到内存)。一段时间后,内存系统会以物理地址的形式回复存储在该物理地址上的数据(这是通过虚拟地址到物理地址的转换得到的结果)。这个过程在缓存命中时非常快速,但在缓存未命中时会比较慢。
那么MMU具体在做什么呢?在每次引用(指令获取、加载、存储)时,MMU会读取(多层级的)页表项,以获取物理帧或发生错误(FAULT)。这个过程会经过缓存再传递到内存,并且最终会读取或写入到物理位置上。
简而言之,MMU的主要功能是实现虚拟地址到物理地址的转换。它通过查找页表来确定虚拟地址对应的物理地址,并将数据从物理地址读取或写入。这个过程中会使用缓存来提高访问速度,但如果在缓存中未找到对应的数据,则需要从内存中读取,这会比较慢。
总结一下,MMU的作用是管理虚拟地址和物理地址之间的转换,并通过读取和写入物理位置来实现对内存的访问。它是处理器和内存系统之间的重要桥梁,对于实现高效的内存管理至关重要。
6.4. TLB: Translation Look-Aside Buffer
What TLB Organization Makes Sense?
- Needs to be really fast
- Critical path of memory access
- In simplest view: before the cache
- Thus, this adds to access time (reducing cache speed)
- Seems to argue for Direct Mapped or Low Associativity
- Critical path of memory access
- However, needs to have very few conflicts!
- With TLB, the Miss Time extremely high! (PT traversal)
- Cost of Conflict (Miss Time) is high
- Hit Time – dictated by clock cycle
- Thrashing: continuous conflicts between accesses
Current Example: Memory Hierarchy
Caches (all 64 B line size)
- L1 I-Cache: 32 KiB/core, 8-way set assoc.
- L1 D Cache: 32 KiB/core, 8-way set assoc., 4-5 cycles load-to-use, Writeback policy
- L2 Cache: 1 MiB/core, 16-way set assoc., Inclusive, Write-back policy, 14 cycles latency
- L3 Cache: 1.375 MiB/core, 11-way set assoc., shared across cores, Noninclusive victim cache, Write-back policy, 50-70 cycles latency
TLB
- L1 ITLB, 128 entries; 8-way set assoc. for 4 KB pages
- 8 entries per thread; fully associative, for 2 MiB / 4 MiB page
- L1 DTLB 64 entries; 4-way set associative for 4 KB pages
- 32 entries; 4-way set associative, 2 MiB / 4 MiB page translations:
- 4 entries; 4-way associative, 1G page translations:
- L2 STLB: 1536 entries; 12-way set assoc. 4 KiB + 2 MiB pages
- 16 entries; 4-way set associative, 1 GiB page translations:
What Happens on a Context Switch?
- Invalidate TLB: simple but might be expensive
- What if switching frequently between processes
- Include ProcessID in TLB
- This is an architectural solution: needs hardware
What if translation tables change?
- For example, to move page from memory to disk or vice versa…
- Must invalidate TLB entry!
- Otherwise, might think that page is still in memory!
- Called “TLB Consistency”
7. Demand Paging
7.1. Page Fault
Definition
The Virtual-to-Physical Translation fails
- PTE marked invalid, Priv. Level Violation, Access violation, or does not exist
- Causes an Fault / Trap
- Not an interrupt because synchronous to instruction execution
- May occur on instruction fetch or data access
- Protection violations typically terminate the instruction
Page Fault → Demand Paging
Inversion of HW/SW Boundary
硬件和软件边界的反转现象:在执行指令时,如果发生页面错误(Page Fault),操作系统的软件会介入并采取相应的措施来解决问题。这可能包括加载页面、创建页面、写时复制等操作,并更新页表项以确保后续的地址转换成功。然后,操作系统会重新启动或恢复指令的执行。这种反转现象在RISC(精简指令集计算机)指令集中是一个巨大的简化,但在x86指令集等复杂指令集中可能会更加复杂,因为指令可能会修改状态。
7.2. Demand Paging
- What “block size”? - 1 page (e.g, 4 KB)
- What “organization” ie. direct-mapped, set-assoc., fullyassociative?
- Any page in any frame of memory,
- i.e., fully associative: arbitrary virtual → physical mapping
- How do we locate a page?
- First check TLB, then page-table traversal
- What is page replacement policy? (i.e. LRU, Random…)
- This requires more explanation… (kinda LRU)
- What happens on a miss?
- Go to lower level to fill miss (i.e. disk)
- What happens on a write? (write-through, write back)
- Definitely write-back – need dirty bit!
Use main memory as “cache” for disk
7.3. Illusion of Infinite Memory
Transparent Level of Indirection (page table)
7.4. Demand Paging Mechanisms
PTE makes demand paging implementatable
- Valid → Page in memory, PTE points at physical page
- Not Valid → Page not in memory; use info in PTE to find it on disk when necessary
What does OS do on a Page Fault
Choose an old page to replace
- If old page modified (“D=1”), write contents back to disk
- Change its PTE and any cached TLB to be invalid
- Load new page into memory from disk
- Update page table entry, invalidate TLB for new entry
- Continue thread from original faulting location
TLB for new page will be loaded when thread continued!
Some Questions
During a page fault, where does the OS get a free frame?
- Keeps a free list
- Unix runs a “reaper” if memory gets too full
- Schedule dirty pages to be written back on disk
- Zero (clean) pages which haven’t been accessed in a while
- As a last resort, evict a dirty page first
- Unix runs a “reaper” if memory gets too full
- How can we organize these mechanisms?
- Work on the replacement policy
- How many page frames/process?
- Like thread scheduling, need to “schedule” memory resources:
- Utilization? fairness? priority?
- Allocation of disk paging bandwidth
- Like thread scheduling, need to “schedule” memory resources:
操作系统通过维护一个空闲列表来管理可用的物理页帧。当内存空间不足时,Unix系统会运行一个“回收器”,将脏页面写回磁盘,并清除一段时间内未被访问的干净页面。作为最后的手段,如果没有空闲的干净页面,操作系统将优先驱逐一个脏页面。为了组织和管理这些机制,需要制定合适的替换策略。
另外,还讨论了每个进程应分配多少个页帧的问题。类似于线程调度,需要对内存资源进行“调度”,考虑利用率、公平性和优先级等因素。此外,还需要合理分配磁盘分页带宽,以满足不同进程的需求。
7.5. Cost Model
8. General I/O
8.1. What about I/O?
- Without I/O, computers are useless (disembodied brains?)
- But… thousands of devices, each slightly different
- How can we standardize the interfaces to these devices?
- Devices unreliable: media failures and transmission errors
- How can we make them reliable?
- Devices unpredictable and/or slow
- How can we manage them if we don’t know what they will do or how they will perform?
8.2. Visualize IO
8.3. I/O Design
Operational Parameters for I/O
- Data granularity: Byte vs. Block
- Some devices provide single byte at a time (e.g., keyboard)
- Others provide whole blocks (e.g., disks, networks, etc.)
- Access pattern: Sequential vs. Random
- Some devices must be accessed sequentially (e.g., tape)
- Others can be accessed “randomly” (e.g., disk, cd, etc.)
- Fixed overhead to start transfers
- Some devices require continual monitoring
- Others generate interrupts when they need service
- Transfer Mechanism: Programmed IO and DMA
The goal of the I/O Subsystem
Provide Uniform Interfaces, Despite Wide Range of Different Devices
Standard Interfaces to Devices
- Block Devices: e.g. disk drives, tape drives, DVD-ROM
- Access blocks of data
- Commands include
open()
,read()
,write()
,seek()
- Raw I/O or file-system access
- Memory-mapped file access possible
- Character Devices: e.g. keyboards, mice, serial ports, some USB devices
- Single characters at a time
- Commands include
get()
,put()
- Libraries layered on top allow line editing
- Network Devices: e.g. Ethernet, Wireless, Bluetooth
- Different enough from block/character to have own interface
- Unix and Windows include socket interface
- Separates network protocol from network operation
- Includes select() functionality
- Usage: pipes, FIFOs, streams, queues, mailboxes
How does User Deal with Timing
- Blocking Interface: “Wait”
- When request data (e.g.
read()
system call), put process to sleep until data is ready - When write data (e.g.
write()
system call), put process to sleep until device is ready for data
- When request data (e.g.
- Non-blocking Interface: “Don’t Wait”
- Returns quickly from read or write request with count of bytes successfully transferred
- Read may return nothing, write may write nothing
- Asynchronous Interface: “Tell Me Later”
- When request data, take pointer to user’s buffer, return immediately; later kernel fills buffer and notifies user
- When send data, take pointer to user’s buffer, return immediately; later kernel takes data and notifies user
Transferring Data to/from Controller
- Programmed I/O:
- Each byte transferred via processor in/out or load/store
- Pro: Simple hardware, easy to program
- Con: Consumes processor cycles proportional to data size
- Direct Memory Access:
- Give controller access to memory bus
- Ask it to transfer data blocks to/from memory directly
- Sample interaction with DMA controller (from OSC book):
I/O Device Notifying the OS
- The OS needs to know when:
- The I/O device has completed an operation
- The I/O operation has encountered an error
- I/O Interrupt:
- Device generates an interrupt whenever it needs service
- Pro: handles unpredictable events well
- Con: interrupts relatively high overhead
- Polling:
- OS periodically checks a device-specific status register
- I/O device puts completion information in status register
- Pro: low overhead
- Con: may waste many cycles on polling if infrequent or unpredictable I/O operations
- OS periodically checks a device-specific status register
- Actual devices combine both polling and interrupts
- For instance – High-bandwidth network adapter:
- Interrupt for first incoming packet
- Poll for following packets until hardware queues are empty
- For instance – High-bandwidth network adapter:
Device Drivers
- Device Driver: Device-specific code in the kernel that interacts directly with the device hardware
- Supports a standard, internal interface
- Same kernel I/O system can interact easily with different device drivers
- Special device-specific configuration supported with the ioctl() system call
- Device Drivers typically divided into two pieces:
- Top half: accessed in call path from system calls
- implements a set of standard, cross-device calls like open(), close(), read(), write(), ioctl(), strategy()
- This is the kernel’s interface to the device driver
- Top half will start I/O to device, may put thread to sleep until finished
- Top half: accessed in call path from system calls
- Bottom half: run as interrupt routine
- Gets input or transfers next block of output
- May wake sleeping threads if I/O now complete
9. File System
9.1. File
File Concept
Contiguous logical address space
- OS abstracts from the physical properties of its storage device to define a logical storage unit called file
- Persistent
- OS maps files to physical devices
Types: Data, Program, Documents
File Structure
- None - sequence of words/bytes
- Simple record structure
- Complex structures
- Can simulate last two with first method by inserting appropriate control
characters - Who decides? OS / Programs
File Attibutes
- Name
- Identifier
- Type
- Location
- Size
- Protection
- Time, Date, User Identification
- Information about files are kept in the directory structure, maintained on disk
File Operations
- Create
- Write
- Read
- Reposition within file - file seek
- Delete
- Truncate
- Open/Close
9.2. Directory
Information in a Device Directory
- File name
- File type
- Address or location
- Current length
- Maximum length
- Date created, last accessed, last updated
- Owner ID, Protection Information
Operations Performed on Directory
- Search for a file
- Create a file
- Delete a file
- List a directory
- Rename a file
- Traverse the file system
Directory Organization
Goals: Efficiency, Naming, Grouping
9.3. File System
Components of a File System
1 | graph LR |
- Open performs Name Resolution
- Translates pathname into a “file number”
- Used as an “index” to locate the blocks
- Creates a file descriptor in PCB within kernel
- Returns a “handle” (another integer) to user process
Directory
- Basically a hierarchical structure
- Each directory entry is a collection of Files / Directories(A link to another entries)
- Each has a name and attributes
- Files have data
- Links (hard links) make it a DAG, not just a tree
- Softlinks (aliases) are another name for an entry
Current working directory: Per-address-space pointer to a directory (inode) used for resolving file names
File
- Named permanent storage
- Contains:
- Data
- Blocks on disk somewhere
- Metadata (Attributes)
- Owner, size, last opened, …
- Access rights
- R, W, X
- Owner, Group, Other (in Unix systems)
- Access control list in Windows system
- Data
In-Memory File System Structures
- Open system call:
- Resolves file name, finds file control block (inode)
- Makes entries in per-process and system-wide tables
- Returns index (called “file handle”) in open-file table
- Read/write system calls:
- Use file handle to locate inode
- Perform appropriate reads or writes
9.4. Our First FileSystem: FAT
FAT: File Allication Table
- Assume (for now) we have a way to translate a path to a “file number”
- i.e., a directory structure
- Disk Storage is a collection of Blocks
- Just hold file data:
- offset
o = < B, x >
- E.g.,:
file_read 31, < 2, x >
- Index into FAT with file number
- Follow linked list to block
- Read the block from disk into memory
FAT Properties
- File is collection of disk blocks
- FAT is linked list 1-1 with blocks
- File Number is index of root of block list for the file
- File offset (
o = < B, x >
) - Follow list to get block #
- Unused blocks → Marked free (no ordering, must scan to find)
- E.g.
file_write(31, < 3, y >)
- Grab free block
- Linking them into file
- Grow file by allocating free blocks and linking them in
FAT Assessment
FAT32 (32 instead of 12 bits) used in Windows, USB drives, SD…
- Where is FAT stored?
- On Disk, on boot cache in memory, second (backup) copy on disk
- What happens when you format a disk?
- Zero the blocks, Mark FAT entries “free”
- What happens when you quick format a disk?
- Mark all entries in FAT as free
- Simple
- Can implement in device firmware
FAT Directories
Directory is a file containing <file_name: file_number>
mappings
- Free space for new/deleted entries
- In FAT: file attributes are kept in directory (!!!)
- Each directory is a linked list of entries
Where do you find root directory ( “/” )?
- At well-defined place on disk
- For FAT, this is at block 2 (there are no blocks 0 or 1)
- Remaining directories are accessed via their file_number
FAT Security Holes
- FAT has no access rights
- FAT has no header in the file blocks
- Just gives an index into the FAT to read data
9.5. Unix File System
Intro
- Original inode format appeared in BSD 4.1
- Berkeley Standard Distribution Unix
- Similar structure for Linux Ext2/3
- File Number is index into inode arrays
- Multi-level index structure
- Great for little and large files
- Asymmetric tree with fixed sized blocks
- Metadata associated with the file
- Rather than in the directory that points to it
- UNIX Fast File System (FFS) BSD 4.2 Locality Heuristics:
- Block group placement
- Reserve space
- Scalable directory structure
Inode Structure
UNIX BSD 4.2 Problem 1
When create a file, don’t know how big it will become (in UNIX, most writes are by appending)
- How much contiguous space do you allocate for a file?
- In BSD 4.2, just find some range of free blocks
- Put each new file at the front of different range
- To expand a file, you first try successive blocks in bitmap, then choose new range of blocks
- Also in BSD 4.2: store files from same directory near each other
UNIX BSD 4.2 Problem 2
Missing blocks due to rotational delay
Issue: Read one block, do processing, and read next block. In meantime, disk has continued turning: missed next block! Need 1 revolution/block!
- Solution1: Skip sector positioning (“interleaving”)
- Place the blocks from one file on every other block of a track: give time for processing to overlap rotation
- Can be done by OS or in modern drives by the disk controller
- Solution 2: Read ahead: read next block right after first, even if application hasn’t asked for it yet
- This can be done either by OS (read ahead)
- By disk itself (track buffers) - many disk controllers have internal RAM that allows them to read a complete track
9.6. Links
Hard link
- Sets another directory entry to contain the file number for the file
- Creates another name (path) for the file
- Each is “first class”
Soft link (or Symbolic Link or Shortcut)
- Directory entry contains the path and name of the file
- Map one name to another name
9.7. NTFS
- New Technology File System (NTFS)
- Default on Microsoft Windows systems
- Variable length extents
- Rather than fixed blocks
- Everything (almost) is a sequence of
attribute:value
pairs- Meta-data and data
- Mix direct and indirect freely
- Directories organized in B-tree structure by default
- Master File Table
- Database with Flexible 1KB entries for metadata/data
- Variable-sized attribute records (data or metadata)
- Extend with variable depth tree (non-resident)
- Extents – variable length contiguous regions
- Block pointers cover runs of blocks
- Similar approach in Linux (ext4)
- File create can provide hint as to size of file
- journaling for reliability
9.8. Memory Mapped Files
- Traditional I/O involves explicit transfers between buffers in process address space to/ from regions of a file
- This involves multiple copies into caches in memory, plus system calls
- What if we could “map” the file directly into an empty region of our address space
- Implicitly “page it in” when we read it
- Write it and “eventually” page it out
- Executable files are treated this way when we exec the process!!
9.9. File System Caching
- Key Idea: Exploit locality by caching data in memory
- Name translations: Mapping from paths → inodes
- Disk blocks: Mapping from block address → disk content
- Buffer Cache: Memory used to cache kernel resources, including disk blocks and name translations
- Can contain “dirty” blocks (blocks yet on disk)
- Replacement policy? LRU
- Can afford overhead of timestamps for each disk block
- Advantages:
- Works very well for name translation
- Works well in general as long as memory is big enough to accommodate a host’s working set of files.
- Disadvantages:
- Fails when some application scans through file system, thereby flushing the cache with data used only once
- E.g., :
find . –exec grep foo {} \;
- Other Replacement Policies?
- Some systems allow applications to request other policies
- E.g., ‘Use Once’:
- File system can discard blocks as soon as they are used
- Cache Size: How much memory should the OS allocate to the buffer cache vs virtual memory?
- Too much memory to the file system cache → won’t be able to run many applications at once
- Too little memory to file system cache → many applications may run slowly (disk caching not effective)
- Solution: adjust boundary dynamically so that the disk access rates for paging and file access are balanced
- Read Ahead Prefetching: fetch sequential blocks early
- Key Idea: exploit fact that most common file access is sequential by prefetching subsequent disk blocks ahead of current read request (if they are not already in memory)
- Elevator algorithm can efficiently interleave groups of prefetches from
concurrent applications - How much to prefetch?
- Too many imposes delays on requests by other applications
- Too few causes many seeks (and rotational delays) among concurrent file requests
- Delayed Writes: Writes to files not immediately sent out to disk
- Instead,
write()
copies data from user space buffer to kernel buffer (in
cache)- Enabled by presence of buffer cache: can leave written file blocks in cache
for a while - If some other application tries to read data before written to disk, file system
will read from cache
- Enabled by presence of buffer cache: can leave written file blocks in cache
- Instead,
- Flushed to disk periodically (e.g. in UNIX, every 30 sec)
- Advantages:
- Disk scheduler can efficiently order lots of requests
- Disk allocation algorithm can be run with correct size value for a file
- Some files need never get written to disk! (e..g temporary scratch files written
/tmp
often don’t exist for 30 sec)
- Disadvantages
- What if system crashes before file has been written out?
- Worse yet, what if system crashes before a directory file has been written
out? (lose pointer to inode!)
9.10. Important Ilities
- Availability: the probability that the system can accept and process requests
- Often measured in “nines” of probability. So, a 99.9% probability is considered “3-nines of availability”
- Key idea here is independence of failures
- Durability: the ability of a system to recover data despite faults
- This idea is fault tolerance applied to data
- Doesn’t necessarily imply availability: information on pyramids was very durable, but could not be accessed until discovery of Rosetta Stone
- Reliability: the ability of a system or component to perform its required functions under stated conditions for a specified period of time (IEEE definition)
- Usually stronger than simply availability: means that the system is not only “up”, but also working correctly
- Includes availability, security, fault tolerance/durability
- Must make sure data survives system crashes, disk crashes, other problems