Department of Computer Sciences THE UNIVERSITY OF TEXAS AT AUSTIN
CS 378 (Spring 2003)
Linux Kernel Programming
Yongguang Zhang
([email protected])
Copyright 2003, Yongguang Zhang
This Lecture
• Linux Networking
– Data structures pertaining to socket interface
– Managing the network families and protocols
– Socket and file operations
– Socket buffer sk_buff operations
– Tracing the output chain
• Questions?
Spring 2003 © 2003 Yongguang Zhang 2
Layering in the Networking Stack
Socket API
BSD Socket (struct socket)
Family-specific socket interface
INET Socket multiplexing (struct sock)
Transport protocols
TCP/UDP
IP input/output chain
IP forwarding/routing
IP-layer protocols
Socket buffer, netfilter
Common network layer queue disciplines
interface to device driver
Spring 2003 © 2003 Yongguang Zhang 3
Family, Type, and Protocol
• Socket has many “family”
– PF_INET, …
• And many types
– SOCK_STREAM, SOCK_DGRAM, SOCK_RAW
• Each family can have many protocols:
– PF_INET: IPPROTO_TCP, IPPROTO_UDP, …
• Two type of socket data structures
– struct socket: for geneirc BSD socket API
– struct sock: for family/protocol specific information
Spring 2003 © 2003 Yongguang Zhang 4
struct socket
• Kernel data structure for a open socket (for BSD
Socket API)
– Data type: struct socket (in include/linux/net.h)
• Important fields:
– socket_state state: state of the socket (connected,
unconnected, disconnecting, or connecting?)
– struct proto_ops *ops: socket API methods
– struct inode *inode, struct file *file: sockfs objects
– stuct sock *sk: network-family-specific socket object
– wait_queue_head_t wait: for blocking send/receive
– short type: type of socket (stream, dgram, or raw?)
Spring 2003 © 2003 Yongguang Zhang 5
struct proto_ops
• Methods for BSD socket API
– Fields:
• int family: network family (PF_INET, ...)
• int (*release)(struct socket *sock)
• int (*bind)(struct socket *sock, ...)
• int (*connect)(struct socket *sock, ...)
• int (*accept)(struct socket *sock, ...)
• ...
• Two sets of methods defined for INET
– See net/ipv4/af_inet.c:
• struct proto_ops inet_stream_ops = { ... };
• struct proto_ops inet_dgram_ops = { ... };
Spring 2003 © 2003 Yongguang Zhang 6
struct sock
• Network family/protocol-specific data structure
– Data type: struct sock (in include/net/sock.h)
– Currently kitchen sink for fields mostly used by TCP
– Need clean up and oganized into unions/external data
structures pertaining to family, protocol, or each layer
• Important fields
– struct proto *prot: protocol-specific socket methods
– union {...} tp_pinfo: transport-protocol-specific fields
• struct tcp_opt af_tcp;
– union {...} protinfo: network-protocol-specific fields
• struct inet_opt af_inet;
Spring 2003 © 2003 Yongguang Zhang 7
More fields of struct sock
• Many more fields (kitchen sink)
– Pointers for lists and hashes: next, prev, pprev, ...
– Addresses and ports: daddr, rcv_saddr, dport, ...
– States and flags: state, zapped, shutdown, refcnt, ...
– Destination and routing: dst_cache
– Packet queues: receive_queue, write_queue, ...
– Buffer, memory, locks: rcvbuf, lock, dst_lock, ...
– Wait queue for blocking send/receive: sleep
– Time stamp and timers: timer, stamp, rcvtimeo, ...
– A set of functions: state_change(), data_ready(), ...
– ...
Spring 2003 © 2003 Yongguang Zhang 8
struct proto
• Socket methods pertaining to specific protocol
– Fields:
• int (*close)(struct sock *sk, ...)
• int (*connect)(struct sock *sk, ...)
• int (*disconnect)(struct sock *sk, ...)
• int (*accept)(struct sock *sk, ...)
• ...
• char name[32]: name of this protocol
• Example objects:
– In net/ipv4/udp.c: struct proto udp_prot = { ... };
– In net/ipv4/tcp_ipv4.c: struct proto tcp_prot = { ... };
Spring 2003 © 2003 Yongguang Zhang 9
Two Socket Structures Illustrated
struct socket
ops
struct sock
struct proto
sk close()
struct proto_ops connect()
prot disconnect()
family accept()
release() ioctl()
bind() init()
tp_pinfo af_tcp tp_raw4
connect() destroy()
socketpair() shutdown()
accept() setsockopt()
getname() getsockopt()
poll() protinfo af_inet af_ipx sendmsg()
ioctl() recvmsg()
listen() bind()
shutdown() blacklog_rcv()
setsockopt() socket hash()
getsockopt() unhash()
sendmsg() get_port()
recvmsg() name
mmap()
sendpage()
Spring 2003 © 2003 Yongguang Zhang 10
Keeping Track of Families/Protocols
• Kernel data structure for network family
– struct net_proto_family
• Two types of kernel data structure for protocol
– For socket operations (multiplexing): struct
inet_protosw
– For receive operation (demultiplexing): struct
inet_protocol
• Kernel keep three lists for above data structures
– One for each type
– A protocol implementation must register to the lists
Spring 2003 © 2003 Yongguang Zhang 11
Network Protocol Family
• Kernel data structure for a network family
– Data type: struct net_proto_family
– Each family must define this structure
• Kernel keeps a list of network protocol families
– As an array of pointers to struct net_proto_family
– Array name: net_families[] (in net/socket.c)
– Each family must register by sock_register(ops)
– Registration function sock_register(ops) essentially
net_families[ops->family]=ops;
Spring 2003 © 2003 Yongguang Zhang 12
INET Family Example
• In net/ipv4/af_net.c:
struct net_proto_family inet_family_ops = {
family: PF_INET,
create: inet_create
};
...
static int __init inet_init(void)
{
...
(void) sock_register(&inet_family_ops);
...
}
Spring 2003 © 2003 Yongguang Zhang 13
INET Protocols for Socket API
• Data structure for protocols supported by socket
– struct inet_protosw (include/net/protocol.h)
• Fields
– type: stream, dgram, or raw?
– protocol: transport protocol number
– struct proto *prot: transport-specific methods
– struct proto_ops *ops: general BSD socket methods
• Kernel keeps a list of objects of this type
– As an array inetsw[] (in net/ipv4/af_inet.c)
– To add to list: call inet_register_protosw(p)
Spring 2003 © 2003 Yongguang Zhang 14
INET Protocols for Demultiplexing
• Data structure for protocols used in input chain
– struct inet_protocol (include/net/protocol.h)
• Major fields
– handler(), err_handler(): incoming packet handlers
– protocol: transport protocol number
• Kernel keeps a list of objects of this type
– Variable inet_protos[] (in net/ipv4/protocol.c)
– To add to list: call inet_add_protocol(prot)
• May not have one-to-one mapping with the list of
protocols supported by socket API
Spring 2003 © 2003 Yongguang Zhang 15
Builtin Protocols in 2.4 Kernel
• Transport protocols supported by socket API:
– TCP (stream type) and UDP (dgram type)
– RAW (not really a protocol, but a way to bypass the
transport layer)
– Look for inetsw_array[] in net/ipv4/af_inet.c:
• Demultiplexing protocols included:
– 4 built-in: TCP, UDP, ICMP, IGMP
– 3 loaded by modules: IPIP, GRE, PIM
– Look for igmp_protocol, tcp_protocol,
udp_protocol, and icmp_protocol in
net/ipv4/protocol.c
Spring 2003 © 2003 Yongguang Zhang 16
Families and Protocols
net_families[] inetsw[] inet_stream_ops inet_dgram_ops
family create SOCK_STREAM PF_INET, PF_INET,
IPPROTO_TCP
inet_release inet_release
&tcp_prot
PF_INET inet_create &inet_stream_ops inet_bind inet_bind
... inet_stream_connect inet_dgram_connect
sock_no_socketpair sock_no_socketpair
SOCK_DGRAM inet_accept sock_no_accept
inet_protos[] IPPROTO_UDP … …
&udp_prot
&inet_dgram_ops
...
tcp_rcv SOCK_RAW
IPPROTO_TCP IPPROTO_IP tcp_prot udp_prot
... &raw_prot
tcp_close udp_close
&inet_dgram_ops
... tcp_v4_connect udp_connect
tcp_disconnect udp_disconnect
udp_rcv tcp_accept NULL
IPPROTO_UDP tcp_ioctl udp_ioctl
... tcp_v4_init_sock NULL
… …
"TCP" "UDP"
Spring 2003 © 2003 Yongguang Zhang 17
Networking Stack Initialization
• Function void sock_init(void) in net/socket.c
– Called by do_basic_setup() in init/main.c
– Initalize the data structures and slab caches
– Register the sockfs file system (will explain later)
• Each network family initializes when loading
– Example: init function int __init inet_init(void) in
inet/ipv4/af_net.c (for INET module)
– Register the INET family (explained before)
– Add all built-in protocols (to both lists)
– Initialize other modules: arp, ip, tcp, icmp, ...
Spring 2003 © 2003 Yongguang Zhang 18
Socket and Inode
• An open socket is an open file in sockfs filesystem
• One-to-one mapping between socket and inode
– In fact, socket object is embedded in the inode object:
struct inode {
...
union {
struct minix_inode_info minix_i;
struct ext2_inode_info ext2_i;
...
struct socket socket_i;
...
} u;
};
Spring 2003 © 2003 Yongguang Zhang 19
Socket Filesystem
• Implementation in net/socket.c
– Every data structures and functions for a file system
struct super_operations sockfs_ops = ...
struct dentry_operations sockfs_dentry_operations = ...
struct file_operations socket_file_ops = {
llseek: no_llseek,
read: sock_read,
write: sock_write,
... };
DECLARE_FSTYPE(sock_fs_type, “sockfs”,
sockfs_read_super, FS_NOMOUNT);
– Register this file system in sock_init():
register_filesystem(&sock_fs_type);
sock_mnt = kern_mount(&sock_fs_type);
Spring 2003 © 2003 Yongguang Zhang 20
Operations on a sockfs fd
• sock_alloc(): to allocate a struct socket object
– Allocate an inode object instead
• sock_map_fd(): to allocate an open file object for
the socket
– Create and setup open file and dentry (including the
links and methods)
• File operations on the task's fd: deligated to the
corresponding socket operations
– sock_read() calls sock_recvmsg()
– sock_write() calls sock_sendmsg()
Spring 2003 © 2003 Yongguang Zhang 21
Allocate a struct socket
• Allocate an inode object instead, and establish the
mapping
– struct socket *sock_alloc(void):
inode = get_empty_inode();
...
inode->i_sb = sock_mnt->mnt_sb;
sock = socki_lookup(inode);
...
sock->inode = inode;
...
return sock;
– struct socket *socki_lookup(struct inode *inode):
return &inode->u.socket_i;
Spring 2003 © 2003 Yongguang Zhang 22
Allocate fd for a Socket
• Create and setup open file and dentry objects
– int sock_map_fd(struct socket *sock):
fd = get_unused_fd();
if (fd >= 0) {
...
file->f_dentry = d_alloc( ... );
file->f_dentry->d_op = &socket_dentry_operations;
d_add(file->f_dentry, sock->inode);
...
sock->file = file;
file->f_op = sock->inode->i_fop = &socket_file_ops;
...
fd_install(fd, file);
}
return fd;
Spring 2003 © 2003 Yongguang Zhang 23
File Operations on a Socket fd
• Translate into the corresponding socket operations
– According to the socket_file_ops method table
– Exmple: ssize_t sock_read(file, ubuf, size, ppos)
sock = socki_lookup(file->f_dentry->d_inode);
...
msg.... = ...
...
return sock_recvmsg(sock, &msg, size, flags);
Spring 2003 © 2003 Yongguang Zhang 24
How is a Socket Created?
• System call service routine sys_socket()
– First call sock_create(), which does
...
sock = sock_alloc()
sock->type = type;
net_families[family]->create(sock, protocol)
…
– Then call sock_map_fd(), to allocate open file and fd
• In the INET family, create is inet_create()
– Essentially, allocate a struct sock object and set up
socket methods for the requested type/protocol pair:
sock->ops = … ->ops; sk->prot = … ->prot;
Spring 2003 © 2003 Yongguang Zhang 25
A UDP Socket Example
inet_dgram_ops
task->files->fd[s] struct inode PF_INET,
struct sock inet_release
inet_bind
inet_dgram_connect
sock_no_socketpair
struct file sock_no_accept
prot …
f_dentry
tp_pinfo udp_prot
struct udp_close
socket udp_connect
struct dentry protinfo af_inet udp_disconnect
ops NULL
inode udp_ioctl
d_inode file
sk NULL
socket …
"UDP"
Spring 2003 © 2003 Yongguang Zhang 26
Socket Buffer
• Very important data structure for a packet
– Carry packet payload through the input/output chains
– Minimize copying
– Memory allocated when the packet is first created (by
socket or by network device driver) and released when
it is done (by socket or network device driver)
• Data type: struct sk_buff
– Defined in include/linux/skbuff.h
Spring 2003 © 2003 Yongguang Zhang 27
sk_buff Fields
• Pointers used to form various lists: next, prev, list
• Which socket it belongs to: sk
• Keeping track of time: stamp
• Incoming or outgoing network device: dev
• Pointer to transport layer header: union { ... } h
• Pointer to network layer header:union { ... } nh
• Pointer to link layer headers: union { ... } mac
• Information of this packet: len, csum, ...
• Pointers to packet data: head, data, tail, end
Spring 2003 © 2003 Yongguang Zhang 28
Buffer Handling
• head, end: beginning and end of allocated space
• data, tail: beginning and end of the packet area
• Space between head and data is for headers
– Can shrink and grow as headers are added or removed
struct sk_buff
Memory allocated for this packet
Packet
head
data
tail
end
Spring 2003 © 2003 Yongguang Zhang 29
sk_buff Operations
• In include/linux/skbuff.h and net/core/socket.c
• struct sk_buff *alloc_skb(size, gfp_mask)
– Create a sk_buff object and allocate memory for size
skb->data = skb->tail = skb->head
skb->end = skb->head + size
skb->len = 0
• struct sk_buff *kree_skb(skb)
– Free the sk_buff and deallocate packet memory
• void skb_reserve(skb, len)
– Reserve headroom space before storing packet
skb->data += len; skb->tail +=len;
Spring 2003 © 2003 Yongguang Zhang 30
More sk_buff Operations
• unsigned char *skb_put(skb, len)
– Allow appending len bytes to tail, return previous tail
skb->tail += len; skb->len += len;
• unsigned char *skb_push(skb, len)
– Allow adding len bytes before data, return new head
skb->data -= len; skb->len += len;
• unsigned char *skb_pull(skb, len)
– Remove len bytes from data, return new head
skb->data += len; skb->len -= len;
• Be carefull – kernel panic if data or tail moved
out of bound
Spring 2003 © 2003 Yongguang Zhang 31
Protocol Headers
• Transport header: union h in sk_buff:
union {
struct tcphdr *th;
struct udphdr *uh;
struct icmphdr *icmph;
...
} h;
• Network header: union nh in sk_buff:
union {
struct iphdr *iph;
struct ipv6hdr *ipv6h;
struct arphdr *arph;
...
} nh;
Spring 2003 © 2003 Yongguang Zhang 32
Getting to the Protocol Headers
• Through the header pointers and header types
– Example: UDP header type in include/linux/udp.h:
struct udphdr {
__u16 source;
__u16 dest;
__u16 len;
__u16 check;
};
• Examples: get to all header fields from sk_buff:
– UDP dest port number: skb->h.uh->dest
– Dest IP address: skb->nh.iph->daddr
Spring 2003 © 2003 Yongguang Zhang 33
Tracing the Output Chain
• User-mode process sends a message with UDP
– sendto(fd, buff, len, flags, addr, addr_len)
• System call service routine sys_sendto()
– Look up struct socket from fd and call the socket
operation to send message:
• sock = sockfd_lookup(fd,&err)
• sock_sendmsg(sock, &msg, len)
– Here sockfd_lookup(fd,err) is essentially
• sock = socki_lookup( fget(fd) ->f_dentry->d_inode )
– And sock_sendmsg(sock,msg,size) is essentially
• sock->ops->sendmsg(sock, msg, size, &scm)
Spring 2003 © 2003 Yongguang Zhang 34
Multiplexing in Transport Layer
• Socket method under INET family and datagram
type (referred in inet_dgram_ops)
– sock->ops->sendmsg → inet_sendmsg()
• inet_sendmsg()
– Deligate to the protocol's sendmsg method:
• sk = sock->sk;
• return sk->prot->sendmsg(sk, msg, size);
• Under UDP protocol (referred in udp_prot):
– sk->prot->sendmsg → udp_sendmsg()
Spring 2003 © 2003 Yongguang Zhang 35
Protocol Layer
• udp_sendmsg() in net/ipv4/udp.c
– Verify the addresses
– Find a route for this packet (call ip_route_output())
– Build the UDP header, etc.
– Call ip_build_xmit() to build and transmit the packet
• Why do we need a route (rt) here?
– Cached in struct sock (in case of connected socket)
– Pass down as an argument to ip_build_xmit()
• tcp_sendmsg() in net/ipv4/tcp.c
– Much more complicated
Spring 2003 © 2003 Yongguang Zhang 36
IP Output: ip_build_xmit()
• Source code in net/ipv4/ip_output.c
• Check for slow path
– Use ip_build_xmit_slow() if need fragmentation
– Need to know path MTU, that's why we need rt early
• Allocate socket buffer (sk_buff)
skb = sock_alloc_send_skb(...)
– Which eventually calls alloc_skb()
• Reverse headroom for link layer header
skb_reserve(skb, hh_len);
– Again, we need to know which type of device from rt
Spring 2003 © 2003 Yongguang Zhang 37
ip_build_xmit() continued
• Fill the packet in sk_buff for the packet
– Expand skb payload to include IP header and message
iph = (struct iphdr *)skb_put(skb, length)
– Build the IP header (we need rt to know source addr)
– Get the fragment from user space (by udp_getfraq()
function, passed as getfrag from udp_sendmsg())
getfrag(fraq, (char *)iph)+iph->ihl*4, 0, length-iph->ihl*4);
• Pass through the netfilter chain
– Netfilter: will cover later
– Eventually reach the network device driver
Spring 2003 © 2003 Yongguang Zhang 38
Summary
• Linux Networking
– LKP: §8
– ULK: §18
• Next lecture: Linux Networking
Spring 2003 © 2003 Yongguang Zhang 39