Docker runtime best practices

Network

  • Do not map privileged ports within containers 
  • Open only needed ports on container
  • Do not mount the Docker socket inside any containers
  • Do not share the host's network namespace
  • Bind incoming container traffic to a specific host interface

Remediations

  • The TCP/IP port numbers below 1024 are considered privileged ports: they can receive and transmit various sensitive and privileged data. So, except for 80 and 443, do not map port below 1024.
  • Only relevant port should be exposed. Do not use  -P or --publish-all .
  • The docker socket (docker.sock) should not be mounted inside a container.
  • Never use --net=host : this choice tells Docker to not containerize the container's networking and that the container lives "outside" in the main Docker host and has full access to its network interfaces.
  • If you have multiple network interfaces on your host machine, the container can accept connections on the exposed ports on any network interface. This might not be desired and may not be secured. Bind the container port to a specific host interface on the desired host port (e.g: --publish 10.2.3.4:49153:80)
Audit
docker inspect --format 'Ports={{.NetworkSettings.Ports}}' CONTAINER_ID
// result: Ports=map[8080/tcp:[] 443/tcp:[{10.2.3.4 443}]]
// result should never return '0.0.0.0'
docker inspect --format 'Volumes={{.Mounts}}' CONTAINER_ID | grep docker.sock
// should return something
docker inspect --format 'NetworkMode={{.HostConfig.NetworkMode}}' CONTAINER_ID
// result should never return 'host' but something like: NetworkMode=CONTAINER_ID

Volumes

  • Do not mount sensitive host system directories on containers
  • Mount container's root file system as read only
  • Do not set mount propagation mode to shared

Remediations

  • Do not mount host sensitive directories on containers especially in read-write mode.
  • Any writes to the root filesystem should be avoided. The data volume belonging to a container should be explicitly defined and administered. --read-only flag must be used to allow the container's root filesystem to be mounted as read only (and use --volume to mount an other volume).
  • Mount propagation mode allows mounting volumes in shared, slave or private mode on a container. Do not use shared mount propagation mode until needed. By default, the container mounts are private. Do not use --volume arguments with shared (eg: --volume=/hostPath:/containerPath:shared).
Audit
docker inspect --format 'Volumes={{.Mounts}}' CONTAINER_ID
// result: no folder of this list: /, /boot, /dev, /etc, /lib, /proc, /sys, /usr
docker inspect --format 'ReadonlyRootfs={{.HostConfig.ReadonlyRootfs}}' CONTAINER_ID
// result: ReadonlyRootfs=true
docker inspect --format 'Propagation={{range $mnt := .Mounts}} {{json $mnt.Propagation}} {{end}}' CONTAINER_ID
// result should not return propagation mode set to 'shared'

Resources

  • Limit memory usage for container
  • Set container CPU priority appropriately
  • Override default ulimit at runtime only if needed

Remediations

  • By default, all containers on a Docker host share the resources equally (container can use all of the memory on the host). You can use memory limit mechanism to prevent a denial of service arising from one container consuming all of the host’s resources. Use  -m or --memory argument with memory size (e.g: --memory 256m).
  • By default, CPU time is divided between containers equally. CPU sharing allows to prioritize one container over the other. Every new container will have 1024 shares of CPU by default. Use --cpu-shares argument (e.g: --cpu-shares 512 to use 50% of what the other containers use).
Audit
docker inspect --format 'Memory={{.HostConfig.Memory}}' CONTAINER_ID
// result should return Memory=0
docker inspect --format 'CpuShares={{.HostConfig.CpuShares}}' CONTAINER_ID 
// result should not return 0 or 1024: it means the CPU shares are not in place

Capabilities/privileges

  • Restrict Linux Kernel Capabilities within containers
  • Do not use privileged containers
  • Confirm cgroup usage
  • Use PIDs cgroup limit if possible
  • Restrict container from acquiring additional privileges

Remediations

  • By default, Docker starts containers with a restricted set of Linux Kernel Capabilities. It means that any process may be granted the required capabilities instead of root access. If you need to change default capabilities, use --cap-drop=all and add only the needed capabilities (using --cap-add=)

  • Using the --privileged flag gives all Linux Kernel Capabilities to the container (and overwrite the --cap-add and --cap-drop flags). Do not use --privileged argument.

  • System administrators typically define cgroups under which containers are supposed to run. Even if cgroups are not explicitly defined by the system administrators, containers run under docker cgroup by default. Do not use --cgroup-parent option in docker run command unless needed.

  • PIDs cgroup limit will prevent fork bombs by restricting the number of forks that can happen inside a container at a given time. If the kernel versions 4.3+, use --pids-limit argument while launching the container with an appropriate value (e.g: --pids-limit 100).

  • The no_new_priv bit ensures that the process or its children processes do not gain any additional privileges via suid or sgid bits. Use --security-opt=no-new-privileges arguments
Audit
docker inspect --format='CapAdd={{.HostConfig.CapAdd}} CapDrop={{.HostConfig.CapDrop}}' CONTAINER_ID
// result: CapAdd=[] CapDrop=[]
docker inspect --format='Privileged={{.HostConfig.Privileged}}' CONTAINER_ID
// result: Privileged=false
docker inspect --format 'CgroupParent={{.HostConfig.CgroupParent}}' CONTAINER_ID
// result should return 'CgroupParent=' or a well known defined cgroup
docker inspect --format 'PidsLimit={{.HostConfig.PidsLimit}}' CONTAINER_ID
// should not return: PidsLimit=0 or PidsLimit=-1
docker inspect --format 'SecurityOpt={{.HostConfig.SecurityOpt}}' CONTAINER_ID
// should return, at least: SecurityOpt=[no_new_priv]

Processes

  • Do not share the host's process namespace
  • Do not share the host's IPC namespace
  • Do not share the host's UTS namespace
  • Do not disable default seccomp profile

Remediations

  • PID namespace provides separation of processes. The PID Namespace removes the view of the system processes, and allows process ids to be reused including PID 1. If the host's PID namespace is shared with the container, it would basically allow processes within the container to see all of the processes on the host system. This breaks the benefit of process level isolation between the host and the containers. Do not start a container with --pid=host argument. If you really need to share host's process namespace, use -p argument to specify only one.
  • IPC namespace provides separation of IPC between the host and containers. If the host's IPC namespace is shared with the container, it would basically allow processes within the container to see all of the IPC on the host system. This breaks the benefit of IPC level isolation between the host and the containers. Do not start a container with --ipc=host argument.
  • UTS namespaces provide isolation of two system identifiers: the hostname and the NIS domain name. It is used for setting the hostname and the domain that is visible to running processes in that namespace. Processes running within containers do not typically require to know hostname and domain name. Do not start a container with --uts=host argument.
  • Seccomp filtering provides a means for a process to specify a filter for incoming system calls. It should not be disabled unless it hinders your container application usage. Do not use --security-opt=seccomp:unconfined argument.

Audit
docker inspect --format 'PidMode={{.HostConfig.PidMode}}' CONTAINER_ID
// result should never return PidMode=host
docker inspect --format 'IpcMode={{.HostConfig.IpcMode}}' CONTAINER_ID
// result should never return IpcMode=host
docker inspect --format 'UTSMode={{.HostConfig.UTSMode}}' CONTAINER_ID
// result should never return: 'UTSMode=host' but 'UTSMode='
docker inspect --format 'SecurityOpt={{.HostConfig.SecurityOpt}}' CONTAINER_ID
// result should never return: 'SecurityOpt=[seccomp:unconfined]'

Related Links