Next.js 16 consuming 1+ CPU core per pod at idle on k3s - constant crash loops
Unanswered
Pacific herring posted this in #help-forum
Pacific herringOP
I'm running Next.js 16.0.10 in production on a k3s cluster and experiencing severe performance issues that I didn't have before migrating to Kubernetes.
The problem:
* Each pod consumes ~1100m CPU (1+ core) constantly, even with zero traffic
* This causes readiness/liveness probes to timeout → pod restarts
* 124+ restarts in 22 hours, creating an endless crash loop
* The app starts fine (
Current metrics (with 0 traffic):
NAME CPU(cores) MEMORY(bytes)
web-app-xxx 1098m 339Mi
web-app-yyy 1177m 280Mi
Inside the pod (top):
PID 1 next-server 29% CPU VSZ 11.1g
Deployment config:
* Resources: 500m CPU request, 2Gi limit
*
* Using
* Production build with
What I've tried:
* Adjusting probe timeouts (no effect)
* Lowering/raising memory limits
* Scaling to 1 pod vs multiple pods (same behavior)
This is a production app that's currently unusable. The app runs perfectly fine locally in development and when I build it locally with
Any insights would be greatly appreciated. I can provide additional logs, configs, or metrics if needed.
The problem:
* Each pod consumes ~1100m CPU (1+ core) constantly, even with zero traffic
* This causes readiness/liveness probes to timeout → pod restarts
* 124+ restarts in 22 hours, creating an endless crash loop
* The app starts fine (
Ready in 153ms) but immediately spins CPU to 100%Current metrics (with 0 traffic):
NAME CPU(cores) MEMORY(bytes)
web-app-xxx 1098m 339Mi
web-app-yyy 1177m 280Mi
Inside the pod (top):
PID 1 next-server 29% CPU VSZ 11.1g
Deployment config:
* Resources: 500m CPU request, 2Gi limit
*
NODE_OPTIONS=--max-old-space-size=1536* Using
emptyDir for .next/cache (20Gi limit)* Production build with
output: 'standalone'What I've tried:
* Adjusting probe timeouts (no effect)
* Lowering/raising memory limits
* Scaling to 1 pod vs multiple pods (same behavior)
This is a production app that's currently unusable. The app runs perfectly fine locally in development and when I build it locally with
next build && next start, so I have no way to reproduce this behavior outside of the k3s environment. I'm stuck debugging in production which is not ideal.Any insights would be greatly appreciated. I can provide additional logs, configs, or metrics if needed.
23 Replies
Pacific herringOP
I have all this kind of error logs :
⨯ Error: {"message":"TypeError: fetch failed","details":"TypeError: fetch failed\n\nCaused by: AggregateError: (ETIMEDOUT)\nAggregateError: \n at internalConnectMultiple (node:net:1122:18)\n at internalConnectMultiple (node:net:1190:5)\n at Timeout.internalConnectMultipleTimeout (node:net:1716:5)\n at listOnTimeout (node:internal/timers:583:11)\n at process.processTimers (node:internal/timers:519:7)","hint":"","code":""}
at ignore-listed frames {
digest: '3713074019'
}
⨯ Error: {"message":"TypeError: fetch failed","details":"TypeError: fetch failed\n\nCaused by: AggregateError: (ETIMEDOUT)\nAggregateError: \n at internalConnectMultiple (node:net:1122:18)\n at internalConnectMultiple (node:net:1190:5)\n at Timeout.internalConnectMultipleTimeout (node:net:1716:5)\n at listOnTimeout (node:internal/timers:583:11)\n at process.processTimers (node:internal/timers:519:7)","hint":"","code":""}
at ignore-listed frames {
digest: '3713074019'
}
⨯ Error: {"message":"TypeError: fetch failed","details":"TypeError: fetch failed\n\nCaused by: AggregateError: (ETIMEDOUT)\nAggregateError: \n at internalConnectMultiple (node:net:1122:18)\n at internalConnectMultiple (node:net:1190:5)\n at Timeout.internalConnectMultipleTimeout (node:net:1716:5)\n at listOnTimeout (node:internal/timers:583:11)\n at process.processTimers (node:internal/timers:519:7)","hint":"","code":""}
at ignore-listed frames {
digest: '3713074019'
}Saint Hubert Jura Hound
Try requesting 2 cores, dont set any mem or cpu limits, also remove the liveness probe for now, lemme know what happens
@Saint Hubert Jura Hound Try requesting 2 cores, dont set any mem or cpu limits, also remove the liveness probe for now, lemme know what happens
Pacific herringOP
So Ive disable every probe and ressources :
Now still have error like :
Ive no idea what is going wrong honestly
# readinessProbe:
# httpGet:
# path: /api/health
# port: 3000
# initialDelaySeconds: 30
# periodSeconds: 15
# timeoutSeconds: 10
# failureThreshold: 3
# livenessProbe:
# httpGet:
# path: /api/health
# port: 3000
# initialDelaySeconds: 120
# periodSeconds: 30
# timeoutSeconds: 15
# failureThreshold: 3
# resources:
# requests:
# cpu: "1"
# memory: "2Gi"
# limits:
# cpu: "2"
# memory: "4Gi"Now still have error like :
TypeError: controller[kState].transformAlgorithm is not a function
at ignore-listed frames
⨯ TypeError: fetch failed
at ignore-listed frames {
[cause]: AggregateError:
at ignore-listed frames {
code: 'ETIMEDOUT'
}
}
⨯ TypeError: fetch failed
at ignore-listed frames {
[cause]: AggregateError:
at ignore-listed frames {
code: 'ETIMEDOUT'
}
}
⨯ TypeError: fetch failed
at ignore-listed frames {
[cause]: AggregateError:
at ignore-listed frames {
code: 'ETIMEDOUT'
}
}
⨯ TypeError: fetch failed
at ignore-listed frames {
[cause]: AggregateError:
at ignore-listed frames {
code: 'ETIMEDOUT'
}
}Ive no idea what is going wrong honestly
And I have :
But the website still loading...
kubectl top pod -l app=web-app ─╯
NAME CPU(cores) MEMORY(bytes)
web-app-7855c6b95c-ggbh9 1079m 634Mi
web-app-7855c6b95c-pg75l 1076m 529Mi But the website still loading...
@Pacific herring And I have :
kubectl top pod -l app=web-app ─╯
NAME CPU(cores) MEMORY(bytes)
web-app-7855c6b95c-ggbh9 1079m 634Mi
web-app-7855c6b95c-pg75l 1076m 529Mi
But the website still loading...
Saint Hubert Jura Hound
its loading? 🤔 like the pages are rendering in the browser n stuff, ur getting 200's?
can u show ur docker image
i didnt look at the error that well earlier tbh but
fetch failed is weird. seems more like a networking issue rather than something else. but idk why that would cause cpu to spike@Saint Hubert Jura Hound its loading? 🤔 like the pages are rendering in the browser n stuff, ur getting 200's?
Pacific herringOP
Wait Ive seen something. Without limit my pod hit this top :
Both pretty hight. But after few minute, the first one and the second one gonna decrease :
And now its working pretty good and fast (not the best in my opinion)
kubectl top pod -l app=web-app ─╯
NAME CPU(cores) MEMORY(bytes)
web-app-7855c6b95c-ggbh9 1052m 991Mi
web-app-7855c6b95c-pg75l 1155m 606Mi Both pretty hight. But after few minute, the first one and the second one gonna decrease :
NAME CPU(cores) MEMORY(bytes)
web-app-7855c6b95c-ggbh9 384m 427Mi
web-app-7855c6b95c-pg75l 516m 419Mi And now its working pretty good and fast (not the best in my opinion)
Can it be related to the
There is my delpoyment.yaml :
emptyDir in k3s for the cache ?There is my delpoyment.yaml :
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
strategy:
type: Recreate
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
spec:
imagePullSecrets:
- name: github-credentials
containers:
- name: web-app
image: myimage
imagePullPolicy: Always
ports:
- containerPort: 3000
readinessProbe:
httpGet:
path: /api/health
port: 3000
initialDelaySeconds: 30
periodSeconds: 15
timeoutSeconds: 10
failureThreshold: 3
livenessProbe:
httpGet:
path: /api/health
port: 3000
initialDelaySeconds: 120
periodSeconds: 30
timeoutSeconds: 15
failureThreshold: 3
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "2"
memory: "4Gi"
env:
- name: NODE_OPTIONS
value: "--max-old-space-size=2048 --dns-result-order=ipv4first"
envFrom:
- secretRef:
name: web-app-secret
volumeMounts:
- name: cache
mountPath: /app/.next/cache
volumes:
- name: cache
emptyDir:
sizeLimit: "20Gi"So I guess during start, my pod make some work (idk what), so maybe that why with ressource limit and probe it was crashing in loop my pods
@Saint Hubert Jura Hound wait so why are u mounting .next/cache in a volume?
Pacific herringOP
Idk ahah how Im supposed to do ? How can I protect the cache being flooded and make my. VPS (k3s node) no storage left ?
@Pacific herring Idk ahah how Im supposed to do ? How can I protect the cache being flooded and make my. VPS (k3s node) no storage left ?
Saint Hubert Jura Hound
well in a standalone build theres no server cache unless u enable a cachehandler im pretty sure. that folder is used for image optimization cache and fetch cache (based on what it says here: https://github.com/vercel/next.js/discussions/74683)
@Saint Hubert Jura Hound well in a standalone build theres no server cache unless u enable a cachehandler im pretty sure. that folder is used for image optimization cache and fetch cache (based on what it says here: https://github.com/vercel/next.js/discussions/74683)
Pacific herringOP
Mhh not very sure because in some project I can defintevely see the cache folder having some data (not image related) in it
Saint Hubert Jura Hound
yea fetch cache?
maybe?
either way theres no need to mount that as a volume
@Saint Hubert Jura Hound yea fetch cache?
Pacific herringOP
yes fetch cache
@Pacific herring Idk ahah how Im supposed to do ? How can I protect the cache being flooded and make my. VPS (k3s node) no storage left ?
Saint Hubert Jura Hound
it would make no difference in terms of cash floods anyway. thats something u need to handle explicitly where necessary
i have a feeling it wont make a difference but try removing the volume. if that doesnt work can u show ur dockerfile?
@Saint Hubert Jura Hound i have a feeling it wont make a difference but try removing the volume. if that doesnt work can u show ur dockerfile?
Pacific herringOP
There is my dockerfile :
FROM node:20-alpine AS base
FROM base AS deps
ARG NPM_TOKEN
ENV NPM_TOKEN=${NPM_TOKEN}
RUN apk add --no-cache libc6-compat
WORKDIR /app
COPY package.json yarn.lock* package-lock.json* pnpm-lock.yaml* .npmrc* ./
RUN \
if [ -f yarn.lock ]; then yarn --frozen-lockfile; \
elif [ -f package-lock.json ]; then npm ci; \
elif [ -f pnpm-lock.yaml ]; then corepack enable pnpm && pnpm i --frozen-lockfile; \
else echo "Lockfile not found." && exit 1; \
fi
FROM base AS builder
# LOADING HERE SOME ENV VAR BUT NOT DISCORD PREMIUM SO.... LIKE THIS
# ARG NEXT_PUBLIC_SITE_URL
# ENV NEXT_PUBLIC_SITE_URL=${NEXT_PUBLIC_SITE_URL}
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY . .
RUN \
if [ -f yarn.lock ]; then yarn run build; \
elif [ -f package-lock.json ]; then npm run build; \
elif [ -f pnpm-lock.yaml ]; then corepack enable pnpm && pnpm run build; \
else echo "Lockfile not found." && exit 1; \
fi
FROM base AS runner
WORKDIR /app
ENV NODE_ENV=production
RUN addgroup --system --gid 1001 nodejs
RUN adduser --system --uid 1001 nextjs
COPY --from=builder /app/public ./public
COPY --from=builder --chown=nextjs:nodejs /app/.next/standalone ./
COPY --from=builder --chown=nextjs:nodejs /app/.next/static ./.next/static
USER nextjs
EXPOSE 3000
ENV PORT=3000
ENV HOSTNAME="0.0.0.0"
CMD ["node", "server.js"]@Pacific herring There is my dockerfile :
yaml
FROM node:20-alpine AS base
FROM base AS deps
ARG NPM_TOKEN
ENV NPM_TOKEN=${NPM_TOKEN}
RUN apk add --no-cache libc6-compat
WORKDIR /app
COPY package.json yarn.lock* package-lock.json* pnpm-lock.yaml* .npmrc* ./
RUN \
if [ -f yarn.lock ]; then yarn --frozen-lockfile; \
elif [ -f package-lock.json ]; then npm ci; \
elif [ -f pnpm-lock.yaml ]; then corepack enable pnpm && pnpm i --frozen-lockfile; \
else echo "Lockfile not found." && exit 1; \
fi
FROM base AS builder
# LOADING HERE SOME ENV VAR BUT NOT DISCORD PREMIUM SO.... LIKE THIS
# ARG NEXT_PUBLIC_SITE_URL
# ENV NEXT_PUBLIC_SITE_URL=${NEXT_PUBLIC_SITE_URL}
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY . .
RUN \
if [ -f yarn.lock ]; then yarn run build; \
elif [ -f package-lock.json ]; then npm run build; \
elif [ -f pnpm-lock.yaml ]; then corepack enable pnpm && pnpm run build; \
else echo "Lockfile not found." && exit 1; \
fi
FROM base AS runner
WORKDIR /app
ENV NODE_ENV=production
RUN addgroup --system --gid 1001 nodejs
RUN adduser --system --uid 1001 nextjs
COPY --from=builder /app/public ./public
COPY --from=builder --chown=nextjs:nodejs /app/.next/standalone ./
COPY --from=builder --chown=nextjs:nodejs /app/.next/static ./.next/static
USER nextjs
EXPOSE 3000
ENV PORT=3000
ENV HOSTNAME="0.0.0.0"
CMD ["node", "server.js"]
Saint Hubert Jura Hound
try switching off of alpine. and also off of node 20. its in maintenance until april this year anyway. best to upgrade to a new node version soon
but that will most likely fix ur issue