首先來看下內(nèi)核如何處理3次握手的半連接隊(duì)列和accept隊(duì)列(其實(shí)也就是server端的三次握手的狀態(tài)變換).而半連接隊(duì)列和accept隊(duì)列在內(nèi)核如何表示,我們上次已經(jīng)介紹過了,這里就不介紹了.
首先我們知道當(dāng)3層的數(shù)據(jù)包到達(dá)之后會(huì)調(diào)用4層的協(xié)議handle,tcp的話就是tcp_v4_rcv.如何調(diào)用可以看我前面的
blog:
而在tcp_v4_rcv中,則最終會(huì)調(diào)用tcp_v4_do_rcv來處理輸入數(shù)據(jù)包.在看tcp_v4_do_rcv之前,我們先來看在tcp_v4_rcv中,內(nèi)核如何通過4元組(目的,源端口和地址)來查找對(duì)應(yīng)得sock對(duì)象.
在分析之前,我們要知道,當(dāng)一對(duì)tcp連接3次握手完畢后,內(nèi)核將會(huì)重新new一個(gè)socket,這個(gè)socket中的大部分域都是與主socket相同的.而把這個(gè)新的socket的狀態(tài)設(shè)置為established,而主socket的狀態(tài)依舊為listen狀態(tài).
而通過前面的blog分析,我們也知道在inet_hashinfo中將處于listening狀態(tài)的socket和處于 TCP_ESTABLISHED與TCP_CLOSE之間的狀態(tài)的socket是分開的,一個(gè)是ehash,一個(gè)是listening_hash.因此通 過對(duì)應(yīng)的4元組查找socket也是分開在這兩個(gè)hash鏈表中操作的.
內(nèi)核是通過調(diào)用__inet_lookup來查找socket的:
Java代碼

- ///在tcp_v4_rcv中的代碼片段.
- sk = __inet_lookup(net, &tcp_hashinfo, iph->saddr,
- th->source, iph->daddr, th->dest, inet_iif(skb));
-
- static inline struct sock *__inet_lookup(struct net *net,
- struct inet_hashinfo *hashinfo,
- const __be32 saddr, const __be16 sport,
- const __be32 daddr, const __be16 dport,
- const int dif)
- {
- u16 hnum = ntohs(dport);
- struct sock *sk = __inet_lookup_established(net, hashinfo,
- saddr, sport, daddr, hnum, dif);
-
- return sk ? : __inet_lookup_listener(net, hashinfo, daddr, hnum, dif);
- }
tcp_hashinfo我們前面也已經(jīng)分析過了,包含了所有tcp所用到的hash信息,比如socket,port等等.這里的查找其實(shí)就是在tcp_hashinfo中(其實(shí)是它的域ehash或者listening_hash)查找相應(yīng)的socket.
我們可以看到內(nèi)核在這里進(jìn)行了兩次查找,首先是在established狀態(tài)的socket中查找,處于established狀態(tài),說明3次握手已經(jīng)完成,因此這個(gè)socket可以通過簡(jiǎn)單的4元組hash在hashinfo的ehash中查找.
而當(dāng)在__inet_lookup_established中沒有找到時(shí),則將會(huì)__inet_lookup_listener中查找.也就是在處于listening狀態(tài)的socket中查找(這里主要是通過daddr也就是目的地址來進(jìn)行匹配).
當(dāng)找到對(duì)應(yīng)的socket以后就會(huì)進(jìn)入數(shù)據(jù)包的處理,也就是進(jìn)入tcp_v4_do_rcv函數(shù).
Java代碼

- int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
- {
- struct sock *rsk;
- ..................................................
-
- ///如果為TCP_ESTABLISHED狀態(tài),則進(jìn)入相關(guān)處理
- if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */
- TCP_CHECK_TIMER(sk);
- if (tcp_rcv_established(sk, skb, tcp_hdr(skb), skb->len)) {
- rsk = sk;
- goto reset;
- }
- TCP_CHECK_TIMER(sk);
- return 0;
- }
-
- ///進(jìn)行包頭的合法性校驗(yàn).
- if (skb->len < tcp_hdrlen(skb) || tcp_checksum_complete(skb))
- goto csum_err;
- ///進(jìn)入TCP_LISTEN狀態(tài).
- if (sk->sk_state == TCP_LISTEN) {
- struct sock *nsk = tcp_v4_hnd_req(sk, skb);
- if (!nsk)
- goto discard;
-
- if (nsk != sk) {
- if (tcp_child_process(sk, nsk, skb)) {
- rsk = nsk;
- goto reset;
- }
- return 0;
- }
- }
-
- TCP_CHECK_TIMER(sk);
- ///進(jìn)入其他狀態(tài)的處理.除了ESTABLISHED和TIME_WAIT狀態(tài).
- if (tcp_rcv_state_process(sk, skb, tcp_hdr(skb), skb->len)) {
- rsk = sk;
- goto reset;
- }
- TCP_CHECK_TIMER(sk);
- return 0;
- ......................................................................
- }
可以看到當(dāng)進(jìn)來之后,會(huì)通過判斷socket的不同狀態(tài)來進(jìn)入不同的處理.這里其實(shí)就分了3種狀態(tài),TCP_ESTABLISHED,TCP_LISTEN和剩余的的狀態(tài).
我們這里先不分析TCP_ESTABLISHED.
我們先來看當(dāng)?shù)谝粋€(gè)syn分解到達(dá)后,內(nèi)核會(huì)做怎么樣處理.首先它會(huì)進(jìn)入tcp_v4_hnd_req函數(shù),這個(gè)函數(shù)我們后面會(huì)處理,這里只需要 知道當(dāng)為第一個(gè)syn分節(jié)時(shí),它會(huì)返回當(dāng)前socket.因此此時(shí)nsk == sk,所以我們進(jìn)入tcp_rcv_state_process函數(shù),這個(gè)函數(shù)處理除了ESTABLISHED和TIME_WAIT狀態(tài)之外的所有狀態(tài).
我們這里只看他的listen狀態(tài)處理,后面的話也是遇到一個(gè)狀態(tài),我們看一個(gè)狀態(tài)的處理:
Java代碼

- int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- struct tcphdr *th, unsigned len)
- {
- struct tcp_sock *tp = tcp_sk(sk);
- ///取得對(duì)應(yīng)的inet_connection_sock .
- struct inet_connection_sock *icsk = inet_csk(sk);
- int queued = 0;
- tp->rx_opt.saw_tstamp = 0;
-
- switch (sk->sk_state) {
- case TCP_LISTEN:
- ///當(dāng)為ack分節(jié),則返回1,而對(duì)應(yīng)內(nèi)核會(huì)發(fā)送一個(gè)rst給對(duì)端.
- if (th->ack)
- return 1;
- ///如果是rst,則忽略這個(gè)分組.
- if (th->rst)
- goto discard;
- ///是syn分組,因此調(diào)用對(duì)應(yīng)的虛函數(shù)conn_request,而這個(gè)函數(shù)在tcpv4中被初始化為tcp_v4_conn_request.
- if (th->syn) {
- if (icsk->icsk_af_ops->conn_request(sk, skb) < 0)
- return 1;
- kfree_skb(skb);
- return 0;
- }
- goto discard;
- ............................................................
- }
可以看到最終會(huì)調(diào)用tcp_v4_conn_request來處理syn分組,我們接下來就來看這個(gè)函數(shù)的實(shí)現(xiàn).
先來看幾個(gè)相關(guān)的函數(shù),第一個(gè)是reqsk_queue_is_full,他來判斷半連接隊(duì)列是否已滿.其實(shí)實(shí)現(xiàn)很簡(jiǎn)單,就是判斷qlen和max_qlen_log的大小:
Java代碼

- static inline int reqsk_queue_is_full(const struct request_sock_queue *queue)
- {
- return queue->listen_opt->qlen >> queue->listen_opt->max_qlen_log;
- }
第二個(gè)是sk_acceptq_is_full,它用來判斷accept隊(duì)列是否已滿.這個(gè)也是很簡(jiǎn)單,比較當(dāng)前的隊(duì)列大小sk_ack_backlog與最大的隊(duì)列大小sk_max_ack_backlog.
Java代碼

- static inline int sk_acceptq_is_full(struct sock *sk)
- {
- return sk->sk_ack_backlog > sk->sk_max_ack_backlog;
- }
最后一個(gè)是tcp_openreq_init,它用來新建一個(gè)inet_request_sock,我們知道每次一個(gè)syn到達(dá)后,我們都會(huì)新建一個(gè)inet_request_sock,并加入到半連接隊(duì)列.
Java代碼

- static inline void tcp_openreq_init(struct request_sock *req,
- struct tcp_options_received *rx_opt,
- struct sk_buff *skb)
- {
- struct inet_request_sock *ireq = inet_rsk(req);
-
- req->rcv_wnd = 0; /* So that tcp_send_synack() knows! */
- req->cookie_ts = 0;
- tcp_rsk(req)->rcv_isn = TCP_SKB_CB(skb)->seq;
- req->mss = rx_opt->mss_clamp;
- req->ts_recent = rx_opt->saw_tstamp ? rx_opt->rcv_tsval : 0;
- ireq->tstamp_ok = rx_opt->tstamp_ok;
- ireq->sack_ok = rx_opt->sack_ok;
- ireq->snd_wscale = rx_opt->snd_wscale;
- ireq->wscale_ok = rx_opt->wscale_ok;
- ireq->acked = 0;
- ireq->ecn_ok = 0;
- ireq->rmt_port = tcp_hdr(skb)->source;
- }
接下來來看tcp_v4_conn_request的實(shí)現(xiàn),
Java代碼

- int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
- {
- struct inet_request_sock *ireq;
- struct tcp_options_received tmp_opt;
- struct request_sock *req;
- __be32 saddr = ip_hdr(skb)->saddr;
- __be32 daddr = ip_hdr(skb)->daddr;
- ///這個(gè)名字實(shí)在是無語(yǔ),when具體表示什么不太理解,只是知道它是用來計(jì)算rtt的.
- __u32 isn = TCP_SKB_CB(skb)->when;
- struct dst_entry *dst = NULL;
- #ifdef CONFIG_SYN_COOKIES
- int want_cookie = 0;
- #else
- #define want_cookie 0 /* Argh, why doesn't gcc optimize this :( */
- #endif
-
- ///如果是廣播或者多播,則丟掉這個(gè)包.
- if (skb->rtable->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))
- goto drop;
-
- ///判斷半連接隊(duì)列是否已經(jīng)滿掉.如果滿掉并且處于非timewait狀態(tài),則丟掉這個(gè)包(如果設(shè)置了SYN Cookie則會(huì)繼續(xù)進(jìn)行,因?yàn)镾YN Cookie不需要新分配半連接隊(duì)列,詳細(xì)的SYN Cookie請(qǐng)google)
- if (inet_csk_reqsk_queue_is_full(sk) && !isn) {
- #ifdef CONFIG_SYN_COOKIES
- if (sysctl_tcp_syncookies) {
- want_cookie = 1;
- } else
- #endif
- goto drop;
- }
- ///如果accept隊(duì)列已滿,并且qlen_young大于一就丟掉這個(gè)包,這里qlen_young大于一表示在syn隊(duì)列中已經(jīng)有足夠多的(這里不包括重傳的syn)請(qǐng)求了.
- if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1)
- goto drop;
- req = inet_reqsk_alloc(&tcp_request_sock_ops);
- if (!req)
- goto drop;
- ...................................................
-
- ///對(duì)tmp_opt進(jìn)行初始化,而tcp_options_received中包含了tcp的一些選項(xiàng)信息(比如mss,窗口擴(kuò)大因子等等)
- tcp_clear_options(&tmp_opt);
- tmp_opt.mss_clamp = 536;
- tmp_opt.user_mss = tcp_sk(sk)->rx_opt.user_mss;
-
- ///對(duì)對(duì)端的tcp_options_received進(jìn)行解析,并對(duì)本端得tcp_options_received進(jìn)行初始化.
- tcp_parse_options(skb, &tmp_opt, 0);
-
- .......................................................
- ///這里對(duì)新的req進(jìn)行初始化.
-
- tcp_openreq_init(req, &tmp_opt, skb);
- ...............................................
-
- ///這里將tcp_options_received保存到req中.
- ireq->opt = tcp_v4_save_options(sk, skb);
- if (!want_cookie)
- TCP_ECN_create_request(req, tcp_hdr(skb));
-
- if (want_cookie) {
- #ifdef CONFIG_SYN_COOKIES
- syn_flood_warning(skb);
- req->cookie_ts = tmp_opt.tstamp_ok;
- #endif
- isn = cookie_v4_init_sequence(sk, skb, &req->mss);
- }else if (!isn) {
- .............................................
- ///計(jì)算當(dāng)前一個(gè)合適的isn,并返回.
- isn = tcp_v4_init_sequence(skb);
- }
-
- ///賦值發(fā)送給對(duì)端的isn
- tcp_rsk(req)->snt_isn = isn;
-
- ///發(fā)送syn和ack(如果設(shè)置了want_cookie則不會(huì)將這個(gè)req鏈接到半連接隊(duì)列中.
- if (__tcp_v4_send_synack(sk, req, dst) || want_cookie)
- goto drop_and_free;
-
- ///將這個(gè)req鏈接到半連接隊(duì)列中.
- inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
- return 0;
-
- drop_and_release:
- dst_release(dst);
- drop_and_free:
- reqsk_free(req);
- drop:
- return 0;
- }
而tcp_v4_hnd_req的主要工作是在半連接隊(duì)列中看是否存在當(dāng)前的socket,如果存在則說明這個(gè)有可能是最終的ack包,因此將會(huì) 做一系列的合法性校驗(yàn)(比如重傳,rst,syn等等),最終確定這個(gè)是ack后會(huì)調(diào)用對(duì)應(yīng)的新建socket的虛函數(shù)syn_recv_sock.
Java代碼

- static struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb)
- {
- struct tcphdr *th = tcp_hdr(skb);
- const struct iphdr *iph = ip_hdr(skb);
- struct sock *nsk;
- struct request_sock **prev;
- ///通過socket,查找對(duì)應(yīng)request_sock
- struct request_sock *req = inet_csk_search_req(sk, &prev, th->source,
- iph->saddr, iph->daddr);
- if (req)
- ///如果存在則進(jìn)入req的相關(guān)處理.
- return tcp_check_req(sk, skb, req, prev);
-
- ///不存在,則通過inet_lookup_established查找.這是因?yàn)橛锌赡墚?dāng)我們進(jìn)入這個(gè)函數(shù)之前,socket的狀態(tài)被改變了,也就是這個(gè)socket的狀態(tài)已經(jīng)不是listen了.
-
- nsk = inet_lookup_established(sock_net(sk), &tcp_hashinfo, iph->saddr,
- th->source, iph->daddr, th->dest, inet_iif(skb));
-
- if (nsk) {
- if (nsk->sk_state != TCP_TIME_WAIT) {
- ///非tw狀態(tài)返回新的socket.
- bh_lock_sock(nsk);
- return nsk;
- }
- ///如果是timewait狀態(tài)則返回空.
- inet_twsk_put(inet_twsk(nsk));
- return NULL;
- }
-
- #ifdef CONFIG_SYN_COOKIES
- if (!th->rst && !th->syn && th->ack)
- sk = cookie_v4_check(sk, skb, &(IPCB(skb)->opt));
- #endif
- return sk;
- }
tcp_check_req最主要工作就是調(diào)用虛函數(shù),新建一個(gè)socket,并返回.
先來看幾個(gè)相關(guān)的函數(shù),第一個(gè)是inet_csk_reqsk_queue_unlink,它主要用來從半連接隊(duì)列unlink掉一個(gè)元素.:
Java代碼

- static inline void inet_csk_reqsk_queue_unlink(struct sock *sk,
- struct request_sock *req,
- struct request_sock **prev)
- {
- reqsk_queue_unlink(&inet_csk(sk)->icsk_accept_queue, req, prev);
- }
-
- static inline void reqsk_queue_unlink(struct request_sock_queue *queue,
- struct request_sock *req,
- struct request_sock **prev_req)
- {
- write_lock(&queue->syn_wait_lock);
- ///處理鏈表.
- *prev_req = req->dl_next;
- write_unlock(&queue->syn_wait_lock);
- }
第二個(gè)是inet_csk_reqsk_queue_removed,它主要用來修改對(duì)應(yīng)的qlen和qlen_young的值.
Java代碼

- static inline void inet_csk_reqsk_queue_removed(struct sock *sk,
- struct request_sock *req)
- {
- if (reqsk_queue_removed(&inet_csk(sk)->icsk_accept_queue, req) == 0)
- inet_csk_delete_keepalive_timer(sk);
- }
-
- static inline int reqsk_queue_removed(struct request_sock_queue *queue,
- struct request_sock *req)
- {
- struct listen_sock *lopt = queue->listen_opt;
- ///如果重傳數(shù)為0則說明沒有重傳過,因此qlen_young跟著也減一.
- if (req->retrans == 0)
- --lopt->qlen_young;
-
- return --lopt->qlen;
- }
最后是inet_csk_reqsk_queue_add,它用來把新的req加入到accept隊(duì)列中.
Java代碼

- static inline void inet_csk_reqsk_queue_add(struct sock *sk,
- struct request_sock *req,
- struct sock *child)
- {
- reqsk_queue_add(&inet_csk(sk)->icsk_accept_queue, req, sk, child);
- }
-
-
- static inline void reqsk_queue_add(struct request_sock_queue *queue,
- struct request_sock *req,
- struct sock *parent,
- struct sock *child)
- {
- req->sk = child;
- sk_acceptq_added(parent);
- ///可以看到剛好就是request_sock_queue的rskq_accept_head與rskq_accept_tail保存accept隊(duì)列.
- if (queue->rskq_accept_head == NULL)
- queue->rskq_accept_head = req;
- else
- queue->rskq_accept_tail->dl_next = req;
-
- queue->rskq_accept_tail = req;
- req->dl_next = NULL;
- }
然后再來看tcp_check_req的實(shí)現(xiàn).
Java代碼

- struct sock *tcp_check_req(struct sock *sk,struct sk_buff *skb,
- struct request_sock *req,
- struct request_sock **prev)
- {
- const struct tcphdr *th = tcp_hdr(skb);
- __be32 flg = tcp_flag_word(th) & (TCP_FLAG_RST|TCP_FLAG_SYN|TCP_FLAG_ACK);
- int paws_reject = 0;
- struct tcp_options_received tmp_opt;
- struct sock *child;
-
- tmp_opt.saw_tstamp = 0;
- ......................................
- ///如果只有rst和syn域則發(fā)送一個(gè)rst給對(duì)端.
- if (flg & (TCP_FLAG_RST|TCP_FLAG_SYN)) {
- TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_ATTEMPTFAILS);
- goto embryonic_reset;
- }
-
- ///如果是重傳的syn,則重新發(fā)送syn和ack分組.
- if (TCP_SKB_CB(skb)->seq == tcp_rsk(req)->rcv_isn &&
- flg == TCP_FLAG_SYN &&
- !paws_reject) {
- req->rsk_ops->rtx_syn_ack(sk, req);
- return NULL;
- }
-
- ..........................................
-
- ///確定有設(shè)置ack分節(jié).
- if (!(flg & TCP_FLAG_ACK))
- return NULL;
-
- ///這里主要處理TCP_DEFER_ACCEPT被設(shè)置的情況,如果它被設(shè)置,則丟掉這個(gè)包.(這是因?yàn)門CP_DEFER_ACCEPT會(huì)等待數(shù)據(jù)真正發(fā)過來才處理的,而不是最后一個(gè)ack發(fā)過來就處理)
- if (inet_csk(sk)->icsk_accept_queue.rskq_defer_accept &&
- TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1) {
- inet_rsk(req)->acked = 1;
- return NULL;
- }
-
- ///可以創(chuàng)建一個(gè)新的socket了.返回一個(gè)包含新創(chuàng)建的socket的request結(jié)構(gòu).
- child = inet_csk(sk)->icsk_af_ops->syn_recv_sock(sk, skb, req, NULL);
- if (child == NULL)
- goto listen_overflow;
- ..................................
- #endif
- ///創(chuàng)建成功,則在request_sock_queue的listen_opt中unlink掉這個(gè)req.也就是從半連接隊(duì)列中刪除這個(gè)req.
- inet_csk_reqsk_queue_unlink(sk, req, prev);
- ///修改對(duì)應(yīng)的 qlen和qlen_young的值.
- inet_csk_reqsk_queue_removed(sk, req);
- ///最后加入到accept隊(duì)列中.這里注意最終是將新的socket賦值給對(duì)應(yīng)的req.
- inet_csk_reqsk_queue_add(sk, req, child);
- return child;
-
- listen_overflow:
- if (!sysctl_tcp_abort_on_overflow) {
- inet_rsk(req)->acked = 1;
- return NULL;
- }
-
- embryonic_reset:
- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_EMBRYONICRSTS);
- if (!(flg & TCP_FLAG_RST))
- req->rsk_ops->send_reset(sk, skb);
-
- inet_csk_reqsk_queue_drop(sk, req, prev);
- return NULL;
- }
最后我們來看內(nèi)核如何創(chuàng)建一個(gè)新的socket,tcp 協(xié)議使用tcp_v4_syn_recv_sock來實(shí)現(xiàn),它做的其實(shí)很簡(jiǎn)單就是新建一個(gè)socket,并且設(shè)置狀態(tài)為TCP_SYN_RECV(在 inet_csk_clone中),父socket繼續(xù)處于listen狀態(tài),然后對(duì)新的socket進(jìn)行一些賦值,然后對(duì)一些定時(shí)器進(jìn)行初始化.這里定 時(shí)器我們?nèi)慷悸赃^了,以后會(huì)專門來分析tcp中的定時(shí)器.
最后從tcp_v4_hnd_req中返回,判斷是否與父socket相等,然后調(diào)用tcp_child_process函數(shù):
這個(gè)函數(shù)主要是完成最終的三次握手,將子socket設(shè)置為TCP_ESTABLISHED然后根據(jù)條件喚醒被accept阻塞的主socket:
Java代碼

- int tcp_child_process(struct sock *parent, struct sock *child,
- struct sk_buff *skb)
- {
- int ret = 0;
- int state = child->sk_state;
-
- if (!sock_owned_by_user(child)) {
- ///完成最終的三次握手.
- ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb),
- skb->len);
- /* Wakeup parent, send SIGIO */
- if (state == TCP_SYN_RECV && child->sk_state != state)
- ///喚醒阻塞的主socket.
- parent->sk_data_ready(parent, 0);
- } else {
- /* Alas, it is possible again, because we do lookup
- * in main socket hash table and lock on listening
- * socket does not protect us more.
- */
- sk_add_backlog(child, skb);
- }
-
- bh_unlock_sock(child);
- sock_put(child);
- return ret;
- }
最后來分析下在tcp_rcv_state_process中的處理當(dāng)前的TCP_SYN_RECV狀態(tài),它主要是為將要到來的數(shù)據(jù)傳輸做一些準(zhǔn)備,設(shè)置一些相關(guān)域.:
Java代碼

- case TCP_SYN_RECV:
- if (acceptable) {
- tp->copied_seq = tp->rcv_nxt;
- smp_mb();
- ///設(shè)置狀態(tài)為TCP_ESTABLISHED.
- tcp_set_state(sk, TCP_ESTABLISHED);
- sk->sk_state_change(sk);
-
- ///這里的wake應(yīng)該是針對(duì)epoll這類的
- if (sk->sk_socket)
- sk_wake_async(sk,
- SOCK_WAKE_IO, POLL_OUT);
-
- ///設(shè)置期望接收的isn號(hào),也就是第一個(gè)字節(jié)的序列和窗口大小.
- tp->snd_una = TCP_SKB_CB(skb)->ack_seq;
- tp->snd_wnd = ntohs(th->window) <<
- tp->rx_opt.snd_wscale;
- tcp_init_wl(tp, TCP_SKB_CB(skb)->ack_seq,
- TCP_SKB_CB(skb)->seq);
-
- .........................................................................
- break;