gpfdist error code = 104 (Connection reset by peer)

Post date: Nov 19, 2014 2:12:5 PM

The connection reset by peer errors can occur in situations where there is high network packet loss. It may be due etl hosts are exhausting their TCP listening queue.

See also Socket Tuning in RHEL

Setting the somaxconn to 1024 is a low risk operation and is typically recommended for web server applications which gpfdist essentially is. When the kernels TCP listening queue is exhausted the kernel will reject new incoming tcp sessions. If the backlog argument is greater than the value in /proc/sys/net/core/somaxconn, then it is silently truncated to that value.

Setting net.core.somaxconn to values higher than default is only needed on very-very loaded servers where connection rate is so high/bursty that having 128 (in BSDs even more: 128 backlog + 64 half-open) concurrent connections is not considered abnormal or when you need to delegate definition of what is normal to people writing application or it's config.

Some administrators use high net.core.somaxconn to hide problems with their services, so from user's point of view process stall would look like a latency spike instead of connection interrupted/timeout (controlled by net.ipv4.tcp_abort_on_overflow in Linux).

Real cause is either slow processing of some requests (e.g. some single threaded blocking server) or insufficient number of worker threads/processes in software (e.g. multi- process/threaded blocking software like apache)

PS. Also as listen(2) manual says - net.core.somaxconn acts only upper boundary for an application which is free to choose something smaller (usually set in app's config), though some apps just use listen(fd, -1) which means set backlog to the max.

PPS. Sometimes it's preferable to fail fast and let the load-balancer to do it's job than to make user wait - for that purpose we set net.core.somaxconn to some high values like 4096, but limit application backlog to something small like 10 and set net.ipv4.tcp_abort_on_overflow to 1.

Workaround

GPDB GUC:

gp_external_max_segs = 64

/etc/sysctl.conf set on ETL nodes only:

net.core.somaxconn = 1024

Changing the somaxconn without rebooting:

echo 1024 > /proc/sys/net/core/somaxconn