当前位置：网站首页>5 figures illustrate the container network

5 figures illustrate the container network

2022-06-26 15:09:00 【BOGO】

Using containers always feels like using magic . For those who understand the underlying principles, containers are easy to use , But it's a nightmare for people who don't understand . Fortunately , We have been studying container technology for a long time , Even successfully uncovered containers are isolated and restricted Linux process , Running containers does not require mirroring , And another aspect , To build an image, you need to run some containers .

Now it's time to solve the container network problem . Or to be more precise , Single host container network problem . This article will answer these questions ：

How to virtualize network resources , Let the container think it has an exclusive network ？
How to make containers coexist peacefully , Will not interfere with each other , And can communicate with each other ？
How to access the outside world from inside the container （ such as , Internet ）？
How to access containers on a machine from the outside world （ such as , Port Publishing ）？

The end result is obvious , Single host container networks are known Linux A simple combination of functions ：

Network namespace （namespace）
fictitious Ethernet equipment （veth）
Virtual network switch （ bridge ）
IP Routing and network address translation （NAT）

And you don't need any code to make such network magic happen ……

Prerequisite

arbitrarily Linux Any distribution is OK . All the examples in this article are in vagrant CentOS 8 Executed on a virtual machine ：

$ vagrant init centos/8 
$ vagrant up 
$ vagrant ssh 

[[email protected] ~]$ uname -a 
Linux localhost.localdomain 4.18.0-147.3.1.el8_1.x86_64

For the sake of simplicity , This article uses a containerized solution （ such as ,Docker perhaps Podman）. We will focus on the basic concepts , And use the simplest tools to achieve learning goals .

network Namespace isolation container

Linux What are the parts of the network stack ？ obviously , It's a series of network devices . Anything else? ？ It may also include a series of routing rules . And don't forget ,netfilter hook, Include iptables Rules define .

We can quickly create a script that is not complicated inspect-net-stack.sh：

#!/usr/bin/env bash 
echo  "> Network devices" 
ip link 

echo -e "\n> Route table" 
ip route 

echo -e "\n> Iptables rules" 
iptables --list-rules

Before running the script , Let's revise iptable rule：

$ sudo iptables -N ROOT_NS

After this , Execute the above script on the machine , Output is as follows ：

$ sudo ./inspect-net-stack.sh 
    > Network devices 
    1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 
    2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000 link/ether 52:54:00:e3:27:77 brd ff:ff:ff:ff:ff:ff 
    > Route table 
    default via 10.0.2.2 dev eth0 proto dhcp metric 100 
    10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15 metric 100 
    > Iptables rules 
    -P INPUT ACCEPT 
    -P FORWARD ACCEPT 
    -P OUTPUT ACCEPT 
    -N ROOT_NS

We are interested in these outputs , Because make sure that each container you are about to create has its own independent network stack .

You may already know , A for container isolation Linux Namespace is a network namespace （network namespace）. from man ip-netns You can see ,“ The network namespace is another logical copy of the network stack , It has its own route , Firewall rules and network devices .” To simplify , This is the only namespace used in this article . We didn't create a completely isolated container , Instead, limit the scope to the network stack .

One way to create a network namespace is ip Tools , It is iproute2 Part of ：

$ sudo ip netns add netns0 
$ ip netns 
netns0

How to use the namespace just created ？ A good command nsenter. Enter one or more specific namespaces , Then execute the specified script ：

$ sudo nsenter --net=/var/run/netns/netns0 bash
     #  New  bash  The process is in  netns0  in 
 $ sudo ./inspect-net-stack.sh 
    > Network devices 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 
    > Route table 
    > Iptables rules 
    -P INPUT ACCEPT 
    -P FORWARD ACCEPT 
    -P OUTPUT ACCEPT

From the output above, you can clearly see bash The process is running in netns0 Namespace , At this time, you see a completely different network stack . There are no routing rules , There is no custom iptables chain, only one loopback Network devices .

Use virtual Ethernet equipment （veth） Connect the host to the container

If we can't communicate with a proprietary network stack , Then it seems useless . Fortunately, ,Linux Provides easy-to-use tools —— fictitious Ethernet equipment . from man veth You can see ,“veth The device is virtual Ethernet equipment . They can act as channels between network namespaces （tunnel）, This creates a bridge to connect to physical network devices in another namespace , But it can also be used as an independent network device .”

fictitious Ethernet Devices usually appear in pairs . Never mind , Let's take a look at the script created ：

$ sudo ip link add veth0 type veth peer name ceth0

With this simple command , We can create a pair of interconnected virtual Ethernet equipment . Default selection veth0 and ceth0 These two names .

$ ip link 
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 
 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
  link/ether 52:54:00:e3:27:77 brd ff:ff:ff:ff:ff:ff 
5: [email protected]: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 
  link/ether 66:2d:24:e3:49:3f brd ff:ff:ff:ff:ff:ff 
6: [email protected]: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 
 link/ether 96:e8:de:1d:22:e0 brd ff:ff:ff:ff:ff:ff

Created veth0 and ceth0 All on the host's network stack （ Also known as root Network namespace ） On . take netns0 Namespace connected to root Namespace , Need to leave a device in root Namespace , The other moved to netns0 in ：

$ sudo ip link set ceth0 netns netns0 
    #  List all devices , You can see  ceth0  Has gone from  root  Disappeared from the stack  
   $ ip link 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 
   2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000 
   link/ether 52:54:00:e3:27:77 brd ff:ff:ff:ff:ff:ff 
   6: [email protected]: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 
    link/ether 96:e8:de:1d:22:e0 brd ff:ff:ff:ff:ff:ff link-netns netns0

Once the device is enabled and the appropriate IP Address , The packet generated on one of the devices will immediately appear in its paired device , Thus connecting the two namespaces . from root Namespace start ：

$ sudo ip link set veth0 up 
$ sudo ip addr add 172.18.0.11/16 dev veth0

And then there was netns0：

$ sudo nsenter --net=/var/run/netns/netns0 
$ ip link set lo up 
$ ip link set ceth0 up 
$ ip addr add 172.18.0.10/16 dev ceth0 
$ ip link 
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 
 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 
5: [email protected]: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 
 link/ether 66:2d:24:e3:49:3f brd ff:ff:ff:ff:ff:ff link-netnsid 0

Check connectivity ：

#  stay  netns0  in  ping root  Of  veth0 
 $ ping -c 2 172.18.0.11 
 PING 172.18.0.11 (172.18.0.11) 56(84) bytes of data. 
 64 bytes from 172.18.0.11: icmp_seq=1 ttl=64 time=0.038 ms 
 64 bytes from 172.18.0.11: icmp_seq=2 ttl=64 time=0.040 ms 
 --- 172.18.0.11 ping statistics --- 
 2 packets transmitted, 2 received, 0% packet loss, time 58ms 
 rtt min/avg/max/mdev = 0.038/0.039/0.040/0.001 ms 
 #  Leave  netns0
 $ exit  
 #  stay root In namespace ping ceth0 
 $ ping -c 2 172.18.0.10 
 PING 172.18.0.10 (172.18.0.10) 56(84) bytes of data. 
 64 bytes from 172.18.0.10: icmp_seq=1 ttl=64 time=0.073 ms 
 64 bytes from 172.18.0.10: icmp_seq=2 ttl=64 time=0.046 ms 
 --- 172.18.0.10 ping statistics --- 
 2 packets transmitted, 2 received, 0% packet loss, time 3ms 
 rtt min/avg/max/mdev = 0.046/0.059/0.073/0.015 ms

meanwhile , If you try from netns0 Namespace access to other addresses , It cannot succeed ：

#  stay  root  Namespace  
   $ ip addr show dev eth0 
   2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 
    link/ether 52:54:00:e3:27:77 brd ff:ff:ff:ff:ff:ff 
    inet 10.0.2.15/24 brd 10.0.2.255 scope global dynamic noprefixroute eth0 
  valid_lft 84057sec preferred_lft 84057sec
    inet6 fe80::5054:ff:fee3:2777/64 scope link 
     valid_lft forever preferred_lft forever 
    #  Remember here  IP  yes  10.0.2.15 
   $ sudo nsenter --net=/var/run/netns/netns0 
   #  Try ping The host eth0 
   $ ping 10.0.2.15 
   connect: Network is unreachable 
   #  Try connecting to the Internet 
   $ ping 8.8.8.8 
   connect: Network is unreachable

It's easy to understand . stay netns0 There is no route for such packets in the routing table . Unique entry How to get to 172.18.0.0/16 The Internet ：

#  stay netns0 Namespace : 
    $ ip route 
    172.18.0.0/16 dev ceth0 proto kernel scope link src 172.18.0.10

Linux There are several ways to establish routing tables . One is to extract the route directly from the network interface . remember , After the namespace is created , netns0 The routing table in is empty . But then we added ceth0 Equipment and assigned IP Address 172.18.0.0/16. Because we don't use simple IP Address , It's a combination of address and subnet mask , The network stack can extract routing information from it . The destination is 172.18.0.0/16 Every network packet will pass through ceth0 equipment . But other packets will be discarded . Allied ,root Namespace also has a new route ：

#  stay root Namespace : 
    $ ip route 
    # ...  Ignore irrelevant lines  ... 
    172.18.0.0/16 dev veth0 proto kernel scope link src 172.18.0.11

here , You can answer the first question . We learned how to isolate , Virtualize and connect Linux Network stack .

Use virtual networks switch（ bridge ） Connect the container

The driving force of containerization is efficient resource sharing . therefore , It is not common to run only one container on a machine . contrary , The ultimate goal is to run as many isolated processes as possible in a shared environment . therefore , If according to the above veth programme , What happens when multiple containers are placed on the same host ？ Let's try adding a second container .

#  from  root  Namespace  
    $ sudo ip netns add netns1 
    $ sudo ip link add veth1 type veth peer name ceth1 
    $ sudo ip link set ceth1 netns netns1 
    $ sudo ip link set veth1 up 
    $ sudo ip addr add 172.18.0.21/16 dev veth1 
    $ sudo nsenter --net=/var/run/netns/netns1 
    $ ip link set lo up 
    $ ip link set ceth1 up 
    $ ip addr add 172.18.0.20/16 dev ceth1

Check connectivity ：

#  from  netns1  Unable to connect  root  Namespace ! 
    $ ping -c 2 172.18.0.21 
    PING 172.18.0.21 (172.18.0.21) 56(84) bytes of data. 
    From 172.18.0.20 icmp_seq=1 Destination Host Unreachable 
    From 172.18.0.20 icmp_seq=2 Destination Host Unreachable 
    --- 172.18.0.21 ping statistics --- 
    2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 55ms pipe 2 
    #  But routing exists ! 
    $ ip route 
    172.18.0.0/16 dev ceth1 proto kernel scope link src 172.18.0.20 
    #  Leave  netns1
    $ exit  
    #  from  root  Namespace cannot be connected  netns1 
    $ ping -c 2 172.18.0.20 
    PING 172.18.0.20 (172.18.0.20) 56(84) bytes of data. 
    From 172.18.0.11 icmp_seq=1 Destination Host Unreachable 
    From 172.18.0.11 icmp_seq=2 Destination Host Unreachable 

--- 172.18.0.20 ping statistics --- 
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 23ms pipe 2 
    #  from netns0 Can connect  veth1 
    $ sudo nsenter --net=/var/run/netns/netns0 
    $ ping -c 2 172.18.0.21 
    PING 172.18.0.21 (172.18.0.21) 56(84) bytes of data. 
    64 bytes from 172.18.0.21: icmp_seq=1 ttl=64 time=0.037 ms 
    64 bytes from 172.18.0.21: icmp_seq=2 ttl=64 time=0.046 ms 
    --- 172.18.0.21 ping statistics --- 
    2 packets transmitted, 2 received, 0% packet loss, time 33ms 
    rtt min/avg/max/mdev = 0.037/0.041/0.046/0.007 ms 
    #  But it's still not connected  netns1 
    $ ping -c 2 172.18.0.20 
    PING 172.18.0.20 (172.18.0.20) 56(84) bytes of data. 
    From 172.18.0.10 icmp_seq=1 Destination Host Unreachable 
    From 172.18.0.10 icmp_seq=2 Destination Host Unreachable 
    --- 172.18.0.20 ping statistics --- 
    2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 63ms pipe 2

dizzy ！ Something went wrong ……netns1 There is a problem . It cannot connect to root, And from root It can't be accessed in the namespace . however , Because both containers are in the same IP Network segment 172.18.0.0/16 in , from netns0 The container can access the host veth1.

It took some time to find out why , However, it is obvious that there is a routing problem . Check first root Namespace routing table ：

$ ip route 
    # ...  Ignore irrelevant lines ... # 
    172.18.0.0/16 dev veth0 proto kernel scope link src 172.18.0.11 
    172.18.0.0/16 dev veth1 proto kernel scope link src 172.18.0.21

A second... Has been added veth Yes, after that ,root The network stack knows the new route 172.18.0.0/16 dev veth1 proto kernel scope link src 172.18.0.21, But the routing of the network already exists before . When the second container tries ping veth1 when , The first selected routing rule is , This causes the network to be disconnected . If we delete the first route sudo ip route delete 172.18.0.0/16 dev veth0 proto kernel scope link src 172.18.0.11, Then recheck the connectivity , There should be no problem .netns1 Can connect , however netns0 No way. .

If we netns1 Select another network segment , It should be connected . however , Multiple containers in the same IP There should be reasonable usage scenarios on the network segment . therefore , We need to adjust veth programme .

Don't forget there's still Linux bridge —— Another virtualization network technology ！Linux A bridge acts like a network switch. It forwards network packets between the interfaces connected to it . And because it is switch, It's in L2 The layer completes these forwarding .

Try this tool . But first of all , You need to clear the existing settings , Because some of the previous configurations are no longer needed . Delete network namespace ：

$ sudo ip netns delete netns0 
$ sudo ip netns delete netns1 
$ sudo ip link delete veth0 
$ sudo ip link delete ceth0 
$ sudo ip link delete veth1 
$ sudo ip link delete ceth1

Quickly rebuild two containers . Be careful , We didn't give new veth0 and veth1 The device allocates any IP Address ：

$ sudo ip netns add netns0 
$ sudo ip link add veth0 type veth peer name ceth0 
$ sudo ip link set veth0 up 
$ sudo ip link set ceth0 netns netns0 

$ sudo nsenter --net=/var/run/netns/netns0 
$ ip link set lo up 
$ ip link set ceth0 up 
$ ip addr add 172.18.0.10/16 dev ceth0 
$ exit 

$ sudo ip netns add netns1 
$ sudo ip link add veth1 type veth peer name ceth1 
$ sudo ip link set veth1 up 
$ sudo ip link set ceth1 netns netns1 

$ sudo nsenter --net=/var/run/netns/netns1 
$ ip link set lo up 
$ ip link set ceth1 up 
$ ip addr add 172.18.0.20/16 dev ceth1 
$ exit

Make sure there are no new routes on the host ：

$ ip route 
default via 10.0.2.2 dev eth0 proto dhcp metric 100 
10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15 metric 100

Finally, create the bridge interface ：

$ sudo ip link add br0 type bridge 
$ sudo ip link set br0 up

take veth0 and veth1 Connect to the bridge ：

$ sudo ip link set veth0 master br0 
$ sudo ip link set veth1 master br0

Check the connectivity between containers ：

$ sudo nsenter --net=/var/run/netns/netns0 
$ ping -c 2 172.18.0.20 
PING 172.18.0.20 (172.18.0.20) 56(84) bytes of data. 
64 bytes from 172.18.0.20: icmp_seq=1 ttl=64 time=0.259 ms 
64 bytes from 172.18.0.20: icmp_seq=2 ttl=64 time=0.051 ms 
--- 172.18.0.20 ping statistics --- 
2 packets transmitted, 2 received, 0% packet loss, time 2ms 
rtt min/avg/max/mdev = 0.051/0.155/0.259/0.104 ms
$ sudo nsenter --net=/var/run/netns/netns1 
$ ping -c 2 172.18.0.10 
PING 172.18.0.10 (172.18.0.10) 56(84) bytes of data. 
64 bytes from 172.18.0.10: icmp_seq=1 ttl=64 time=0.037 ms 
64 bytes from 172.18.0.10: icmp_seq=2 ttl=64 time=0.089 ms 
--- 172.18.0.10 ping statistics --- 
2 packets transmitted, 2 received, 0% packet loss, time 36ms 
rtt min/avg/max/mdev = 0.037/0.063/0.089/0.026 ms

Great ！ Well done . With this new scheme , We don't need to configure at all veth0 and veth1. Only need ceth0 and ceth1 The endpoint is assigned two IP Address . But because they are all connected to the same Ethernet On （ remember , They are connected to virtual machines switch On ）, Between L2 Layers are connected ：

$ sudo nsenter --net=/var/run/netns/netns0 
$ ip neigh 
172.18.0.20 dev ceth0 lladdr 6e:9c:ae:02:60:de STALE 
$ exit 

$ sudo nsenter --net=/var/run/netns/netns1 
$ ip neigh 
172.18.0.10 dev ceth1 lladdr 66:f3:8c:75:09:29 STALE 
$ exit

Great , We learned how to turn containers into friends , Let them not interfere with each other , But it can be connected .

Connect the outside world （ IP Routing and address camouflage （masquerading））

Containers can communicate with each other . But they can communicate with the host , such as root Namespace , Correspondence ？

$ sudo nsenter --net=/var/run/netns/netns0 
$ ping 10.0.2.15 # eth0 address 
connect: Network is unreachable

It's obvious here ,netns0 There is no route ：

$ ip route 
172.18.0.0/16 dev ceth0 proto kernel scope link src 172.18.0.10

root Namespace cannot communicate with container ：

#  use first  exit  Leave netns0: 
$ ping -c 2 172.18.0.10 
PING 172.18.0.10 (172.18.0.10) 56(84) bytes of data. 
From 213.51.1.123 icmp_seq=1 Destination Net Unreachable 
From 213.51.1.123 icmp_seq=2 Destination Net Unreachable 
--- 172.18.0.10 ping statistics --- 
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 3ms 

$ ping -c 2 172.18.0.20 
PING 172.18.0.20 (172.18.0.20) 56(84) bytes of data. 
From 213.51.1.123 icmp_seq=1 Destination Net Unreachable 
From 213.51.1.123 icmp_seq=2 Destination Net Unreachable 
--- 172.18.0.20 ping statistics --- 
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 3ms

To build root Connectivity with container namespaces , We need to assign... To the bridge network interface IP Address ：

$ sudo ip addr add 172.18.0.1/16 dev br0

Once the bridge network interface is assigned IP Address , There will be one more route in the host's routing table ：

$ ip route 
# ... Ignore irrelevant lines  ... 
172.18.0.0/16 dev br0 proto kernel scope link src 172.18.0.1 

$ ping -c 2 172.18.0.10 
PING 172.18.0.10 (172.18.0.10) 56(84) bytes of data. 
64 bytes from 172.18.0.10: icmp_seq=1 ttl=64 time=0.036 ms 
64 bytes from 172.18.0.10: icmp_seq=2 ttl=64 time=0.049 ms 

--- 172.18.0.10 ping statistics --- 
2 packets transmitted, 2 received, 0% packet loss, time 11ms 
rtt min/avg/max/mdev = 0.036/0.042/0.049/0.009 ms 

$ ping -c 2 172.18.0.20 
PING 172.18.0.20 (172.18.0.20) 56(84) bytes of data. 
64 bytes from 172.18.0.20: icmp_seq=1 ttl=64 time=0.059 ms 
64 bytes from 172.18.0.20: icmp_seq=2 ttl=64 time=0.056 ms 

--- 172.18.0.20 ping statistics --- 
2 packets transmitted, 2 received, 0% packet loss, time 4ms 
rtt min/avg/max/mdev = 0.056/0.057/0.059/0.007 ms

Containers may also work ping bridge interface , But they still can't connect to the host eth0. You need to add a default route for the container ：

$ sudo nsenter --net=/var/run/netns/netns0 
$ ip route add default via 172.18.0.1 
$ ping -c 2 10.0.2.15 
PING 10.0.2.15 (10.0.2.15) 56(84) bytes of data. 
64 bytes from 10.0.2.15: icmp_seq=1 ttl=64 time=0.036 ms 
64 bytes from 10.0.2.15: icmp_seq=2 ttl=64 time=0.053 ms 
--- 10.0.2.15 ping statistics --- 
2 packets transmitted, 2 received, 0% packet loss, time 14ms 
rtt min/avg/max/mdev = 0.036/0.044/0.053/0.010 ms 
    #  by `netns1` Also do the above configuration

This change basically turns the host into a router , And the bridge interface becomes the default gateway between containers .

very good , We connect the container to root Namespace . Now? , Continue trying to connect them to the outside world .Linux By default disable Network packet forwarding （ such as , Routing functions ）. We need to enable this function first ：

#  stay  root  Namespace  
sudo bash -c 'echo 1 > /proc/sys/net/ipv4/ip_forward'

Check connectivity again ：

$ sudo nsenter --net=/var/run/netns/netns0 
$ ping 8.8.8.8 
# hung Live in the ...

Still not working . What's wrong ？ If the container can be contracted out , Then the target server cannot send the package back to the container , Because of the container IP The address is private , That particular IP Only the local network knows the routing rules . And many containers share exactly the same private IP Address 172.18.0.10. The solution to this problem is called network address translation （NAT）. Before reaching the external network , The package sent by the container will send the source IP Replace the address with the external network address of the host . The host also keeps track of all existing mappings , Will recover the previously replaced before forwarding the packet back to the container IP Address . Sounds complicated , But there's good news ！iptables Module allows us to do all this with just one command ：

$ sudo iptables -t nat -A POSTROUTING -s 172.18.0.0/16 ! -o br0 -j MASQUERADE

The order is very simple . stay nat A new entry has been added to the list POSTROUTING chain The new route of , Will replace all camouflage from 172.18.0.0/16 Packet of network , But not through the bridge interface .

Check connectivity ：

$ sudo nsenter --net=/var/run/netns/netns0 
$ ping -c 2 8.8.8.8 PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data. 
64 bytes from 8.8.8.8: icmp_seq=1 ttl=61 time=43.2 ms 
64 bytes from 8.8.8.8: icmp_seq=2 ttl=61 time=36.8 ms 
--- 8.8.8.8 ping statistics --- 
2 packets transmitted, 2 received, 0% packet loss, time 2ms 
rtt min/avg/max/mdev = 36.815/40.008/43.202/3.199 ms

You know the default policy we use here —— Allow all traffic , This is very dangerous in the real environment . The default of the host iptables Strategy is ACCEPT：

sudo iptables -S 
-P INPUT ACCEPT 
-P FORWARD ACCEPT 
-P OUTPUT ACCEPT

Docker All traffic is limited by default , Then enable routing only for known paths .

Here's how CentOS 8 On the machine , A single container exposes ports 5005 when , from Docker daemon Generated rules ：

$ sudo iptables -t filter --list-rules 
-P INPUT ACCEPT 
-P FORWARD DROP 
-P OUTPUT ACCEPT 
-N DOCKER 
-N DOCKER-ISOLATION-STAGE-1 
-N DOCKER-ISOLATION-STAGE-2 
-N DOCKER-USER 
-A FORWARD -j DOCKER-USER 
-A FORWARD -j DOCKER-ISOLATION-STAGE-1 
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -o docker0 -j DOCKER 
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT 
-A FORWARD -i docker0 -o docker0 -j ACCEPT 
-A DOCKER -d 172.17.0.2/32 ! -i docker0 -o docker0 -p tcp -m tcp --dport 5000 -j ACCEPT 
-A DOCKER-ISOLATION-STAGE-1 -i docker0 ! -o docker0 -j DOCKER-ISOLATION-STAGE-2 
-A DOCKER-ISOLATION-STAGE-1 -j RETURN 
-A DOCKER-ISOLATION-STAGE-2 -o docker0 -j DROP 
-A DOCKER-ISOLATION-STAGE-2 -j RETURN 
-A DOCKER-USER -j RETURN 

$ sudo iptables -t nat --list-rules 
-P PREROUTING ACCEPT 
-P INPUT ACCEPT 
-P POSTROUTING ACCEPT 
-P OUTPUT ACCEPT 
-N DOCKER 
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER 
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE 
-A POSTROUTING -s 172.17.0.2/32 -d 172.17.0.2/32 -p tcp -m tcp --dport 5000 -j MASQUERADE
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER 
-A DOCKER -i docker0 -j RETURN 
-A DOCKER ! -i docker0 -p tcp -m tcp --dport 5005 -j DNAT --to-destination 172.17.0.2:5000 

$ sudo iptables -t mangle --list-rules 
-P PREROUTING ACCEPT 
-P INPUT ACCEPT 
-P FORWARD ACCEPT 
-P OUTPUT ACCEPT
 -P POSTROUTING ACCEPT 

$ sudo iptables -t raw --list-rules 
-P PREROUTING ACCEPT 
-P OUTPUT ACCEPT

Make containers accessible to the outside world （ Port Publishing ）

As we all know, you can publish container ports to some （ Or all ） Host interface . But what exactly does port publishing mean ？

Suppose a server is running inside the container ：

$ sudo nsenter --net=/var/run/netns/netns0 
$ python3 -m http.server --bind 172.18.0.10 5000

If we try to send a message from the host HTTP Request to this server , Everything works well （root There are links between namespace and all container interfaces , Of course, you can connect successfully ）：

#  from  root  Namespace  
$ curl 172.18.0.10:5000 
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"  "http://www.w3.org/TR/html4/strict.dtd"> 
# ...  Ignore irrelevant lines  ...

however , If you want to access this server from outside , Which should I use IP Well ？ The only thing we know IP Is the external interface address of the host eth0：

$ curl 10.0.2.15:5000 
curl: (7) Failed to connect to 10.0.2.15 port 5000: Connection refused

therefore , We need to find a way , Able to reach the host eth0 5000 All packets of the port are forwarded to the destination 172.18.0.10:5000. again i ptables To help. ！

#  External flow   
    sudo iptables -t nat -A PREROUTING -d 10.0.2.15 -p tcp -m tcp --dport 5000 -j DNAT --to-destination 172.18.0.10:5000 
    #  Local traffic  ( Because it didn't pass  PREROUTING chain) 
    sudo iptables -t nat -A OUTPUT -d 10.0.2.15 -p tcp -m tcp --dport 5000 -j DNAT --to-destination 172.18.0.10:5000

in addition , Need make iptables It can intercept traffic on the bridge network ：

sudo modprobe br_netfilter

test ：

curl 10.0.2.15:5000 
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"  "http://www.w3.org/TR/html4/strict.dtd">
      # ...  Ignore irrelevant lines  ...

understand Docker Network driven

How can we use this knowledge ？ such as , Try to understand Docke r Network mode [1].

from --network host Mode start . Try comparing commands ip link and sudo docker run -it --rm --network host alpine ip link Output . They are almost the same ！ stay host In mode ,Docker Simply do not use network namespace isolation , The container is just root Work in the network namespace , And share the network stack with the host .

The next pattern is --network none.sudo docker run -it --rm --network host alpine ip link There is only one output loopback Network interface . This is the same as the previously created network namespace , Does not add veth The front of the equipment is very similar .

And finally --network bridge（ Default ） Pattern . This is the pattern we tried to create earlier . You can try ip and iptables command , Look at the network stack from the perspective of host and container respectively .

rootless Containers and networks

Podman A good feature of container manager is that it focuses on rootless Containers . however , You might notice , This article uses a lot of sudo command . explain , No, root Permissions failed to configure network .Podman stay root Solutions on the network [2] and Docker Very similar . But in rootless On the container ,Podman Used slirp4netns[3] project ：

from Linux 3.8 Start , Non privileged users can create user_namespaces(7) At the same time create network_namespaces(7). however , Nonprivileged network namespace is not very useful , Because between the host and the network namespace veth(4) Still need root jurisdiction

slirp4netns The network namespace can be connected to... In a completely non privileged way Internet On , Through a in the network namespace TAP The device is connected to the user interface TCP/IP Stack （slirp）.

rootless The network is very limited ：“ Technically speaking , The container itself doesn't have IP Address , Because no root jurisdiction , The association of network devices cannot be realized . in addition , from rootless Containers ping It won't work , Because it lacks CAP_NET_RAW Safety capability , And this is ping The command is required .” But it's still better than no connection at all .

Conclusion

The scheme of organizing container network introduced in this paper is only one of the possible schemes （ Probably the most widely used ）. There are many other ways , Implemented by official or third-party plug-ins , But all these schemes rely heavily on Linux Network virtualization technology [4]. therefore , Containerization can be considered as a virtualization technology .