basics
- errno:
- thread local
- never cleared until actual error
system limits
<unistd.h> // compile time
sysconf()
pathconf() fpathconf()
file
- execution bit:
- for file, become executable
- for directory, you can access its content (read allow you to ls)
chmod +X
: only apply x for folder
- hard link: create a file with the same inode number
- inode is unique on one device. Thus, hard link does not work across drives.
- also, reference counting is necessary
- symbol link: create a symbolic file that points to another file
ln -s
,-r, --relative
- bind mount: make a folder available at another position
- there are possible drawbacks when having the same folder appear twice in the system
- rbind: bind recursively
- loop mount: make a file accessible as a block device
- if the file contains an entire file system, it can then be mounted as a disk device
uid
- RUID (real)
- actual owner of a process
- EUID (effective)
- authority of a process is determined according to EUID
- SUID (set)
- When a command with SUID bit is run, its EUID become that of the owner of the file, rather than that of the user who is running it.
rws
for SUID + x,rwS
for SUID only
chmod +s
- for file: apply SUID
- for directory: the file created in this directory will have same group as this directory, instead of have user’s group
access
check if RUID has capability- by default, will follow symbolic link
- suid will not work for things which need a interpreter ex. script
ltrace
does not work because of security
sticky bit
- used for shared directories
- can create, read and execute files owned by other users
- but cannot remove files owned by other users
chmod +t
acl
# change default acl
setfacl -d -m g:basic:rwx .
# remove entry from acl
setfacl -x
# remove acl
setfacl -bn
attribute
chattr
A
: do not update atime recordS
: changes are synchronously write to diska
: only open in append modei
: cannot be modified (no renaming, no symbolic link, no execution)j
: write journal before contentt
: no tail-merging (multiple tail block can be merged to reduce disk usage)
fd
- file status flags:
O_SYNC
:- when fd
close
fail, data is not guaranteed to be written
- when fd
O_APPEND
:- cannot lseek -> gurantee that write is sequential
- file descriptor flags:
fnctl
FD_CLOEXEC
: if set, fd is closed whenexec()
- prevent fd leak to child process
- shared access:
- fd can point to same file ->
dup()
- file can be open multiple times
- lseek() + read() / write is not atomic
- pread(), pwrite()
- fd can point to same file ->
- umask: set file permission for various syscall
mode & ~umask
- can be read from
/proc/pid/status
and/proc/self/status
- additional action
fnctl
,ioctl
- stdio can be treated the same way as other fd
- we can pass
/proc/self/
/dev/fd/
- some program use
-
to differentiate read from stdin- such as
cat
- such as
- we can pass
io mode
- blocking
- non-blocking: O_NONBLOCK
- polling mode
- partial success
- EAGAIN
- multiplexing
- select: FDSET
- poll: fd array
- epoll: change fds while waiting
- signal based io: SIGIO
- check if able to read / write, do it after signal
- async:
aio_
- read / write first, check result later
- EINPROGRESS
stdio
- input / output stream
- buffer
- line: flush when receiving line end
- associated with terminal
- full: flush when buffer is full
- associated with disk file
- pipe
- none
- line: flush when receiving line end
- buffer can cause problem
- leftover buffer after fork
- unexpected buffer where no one flushes
process
the kernel starts as a dummy process 0
the first process to run is init
as process 1
SysV
systemd
fork
: duplicate the current process context- will copy the address space
exec
: replace current context with new processwait
: wait for process returnclone
: similar to fork, but share the address space
subprocess
- if a parent does not read its child process exit state (wait),
- the child become zombie after it finished execution
- if we do not want zombie
- we can fork twice so that child is orphaned instead.
- if we ignore child state change signal, the terminal state will not be saved
- SIG_IGN SIGCLD
ps aux | grep Z
- if a parent exit before its child process,
- the child is transfered to
init
(orphaned) init
will clean up periodically- orphaned process get SIGHUP and SIGCONT
- the child is transfered to
- daemon process
- let child process become its own session
session
- parent process id
ppid
- process group
pgid
getpgid()
/setpgid()
- pipelined process can be placed into one group
- session
sid
setsid()
create new session- the process become the session leader
- network hang up is sent to session leader
- a session can have one controlling terminal
- controlling terminal: some tty device
/dev/tty
: current tty device- foreground / background process group
- terminal send signal to the foreground pg
tpgid
tcgetpgrp
/tcsetpgrp
- bg can talk to
/dev/tty
- terminal send signal to the foreground pg
- job control
- when reading in background, process get SIGTTIN
- if orphaned, read gets EIO
- when writing in background, process get SIGTTOU
- when reading in background, process get SIGTTIN
signal
signal, pause, setjmp, longjmp
- signal is reset to default after trigger
- signal cannot be turned off
- system call will be interrupted
- might auto restart
- reentrant function: some func cannot be used because their use of global struct
sigaction, sigsuspend, sigsetjmp, siglongjmp,
- sigaction does not reset
- sigaction allow mask
- automatically masked unless SA_NODEFER
- mask is inherit through fork so we need to make sure that we only attach handler when necessary
- sa_flags:
- SA_RESTART, SA_INTERRUPT: default to restart
- SA_SIGINFO: more info for signal (also, enable signal queue)
- sigsuspend is atomic:
- race between signal set and pause
- jmp needs to make sure that mask is cleared
signal note
- async signal safe != thread safe
- thread is in band, thread safe can be achieved with mutex
- signal is out of band, reentrant cannot have broken global state
- handle signal in special thread
- when ignoring SIGCLD / SIGCHLD, will not generate zombie
- SIGCLD:
- if child can be waited, signal is triggered immediately
- only set signal after wait
- if child can be waited, signal is triggered immediately
- SIGCHLD:
- generate when child state actually changes?
- SIGCLD:
- SIGHUP: does not make sense for daemon process
- so they are usually used to reload config
pthread
pthread_atfork()
strerror_r()
ipc
- pipe
- fifo, named pipe
- socket
memory mapped io
mmap()
shm_open()
- create fd which is not back by actual file on disk
- on linux, equivalent to create file in
/dev/shm
memfd_create()
- similar to shm_open, but the name does not refer to actual file
hugetlbfs https://github.com/libhugetlbfs/libhugetlbfs
https://linux.die.net/man/2/get_mempolicy
https://www.kernel.org/doc/Documentation/vm/numa_memory_policy.txt
LD_DEBUG
seccomp
limit syscall
seccomp_init
seccomp_rule_add
seccomp_load
syscall tables - strace
https://fedora.juszkiewicz.com.pl/syscalls.html https://github.com/hrw/syscalls-table
reference
https://chromium.googlesource.com/chromiumos/docs/+/master/constants/syscalls.md
https://rigtorp.se/low-latency-guide/
ipc
https://blog.cloudflare.com/scalable-machine-learning-at-cloudflare/
https://github.com/cloudflare/mmap-sync
https://github.com/goldsborough/ipc-bench