Next.js Discord

Discord Forum

Next.js 16 consuming 1+ CPU core per pod at idle on k3s - constant crash loops

Unanswered
Pacific herring posted this in #help-forum
Open in Discord
Pacific herringOP
I'm running Next.js 16.0.10 in production on a k3s cluster and experiencing severe performance issues that I didn't have before migrating to Kubernetes.

The problem:

* Each pod consumes ~1100m CPU (1+ core) constantly, even with zero traffic
* This causes readiness/liveness probes to timeout → pod restarts
* 124+ restarts in 22 hours, creating an endless crash loop
* The app starts fine (Ready in 153ms) but immediately spins CPU to 100%

Current metrics (with 0 traffic):

NAME CPU(cores) MEMORY(bytes)
web-app-xxx 1098m 339Mi
web-app-yyy 1177m 280Mi

Inside the pod (top):

PID 1 next-server 29% CPU VSZ 11.1g

Deployment config:

* Resources: 500m CPU request, 2Gi limit
* NODE_OPTIONS=--max-old-space-size=1536
* Using emptyDir for .next/cache (20Gi limit)
* Production build with output: 'standalone'

What I've tried:

* Adjusting probe timeouts (no effect)
* Lowering/raising memory limits
* Scaling to 1 pod vs multiple pods (same behavior)

This is a production app that's currently unusable. The app runs perfectly fine locally in development and when I build it locally with next build && next start, so I have no way to reproduce this behavior outside of the k3s environment. I'm stuck debugging in production which is not ideal.

Any insights would be greatly appreciated. I can provide additional logs, configs, or metrics if needed.

23 Replies

Pacific herringOP
I have all this kind of error logs :
 ⨯ Error: {"message":"TypeError: fetch failed","details":"TypeError: fetch failed\n\nCaused by: AggregateError:  (ETIMEDOUT)\nAggregateError: \n    at internalConnectMultiple (node:net:1122:18)\n    at internalConnectMultiple (node:net:1190:5)\n    at Timeout.internalConnectMultipleTimeout (node:net:1716:5)\n    at listOnTimeout (node:internal/timers:583:11)\n    at process.processTimers (node:internal/timers:519:7)","hint":"","code":""}
    at ignore-listed frames {
  digest: '3713074019'
}
 ⨯ Error: {"message":"TypeError: fetch failed","details":"TypeError: fetch failed\n\nCaused by: AggregateError:  (ETIMEDOUT)\nAggregateError: \n    at internalConnectMultiple (node:net:1122:18)\n    at internalConnectMultiple (node:net:1190:5)\n    at Timeout.internalConnectMultipleTimeout (node:net:1716:5)\n    at listOnTimeout (node:internal/timers:583:11)\n    at process.processTimers (node:internal/timers:519:7)","hint":"","code":""}
    at ignore-listed frames {
  digest: '3713074019'
}

 ⨯ Error: {"message":"TypeError: fetch failed","details":"TypeError: fetch failed\n\nCaused by: AggregateError:  (ETIMEDOUT)\nAggregateError: \n    at internalConnectMultiple (node:net:1122:18)\n    at internalConnectMultiple (node:net:1190:5)\n    at Timeout.internalConnectMultipleTimeout (node:net:1716:5)\n    at listOnTimeout (node:internal/timers:583:11)\n    at process.processTimers (node:internal/timers:519:7)","hint":"","code":""}
    at ignore-listed frames {
  digest: '3713074019'
}
Saint Hubert Jura Hound
Try requesting 2 cores, dont set any mem or cpu limits, also remove the liveness probe for now, lemme know what happens
@Saint Hubert Jura Hound Try requesting 2 cores, dont set any mem or cpu limits, also remove the liveness probe for now, lemme know what happens
Pacific herringOP
So Ive disable every probe and ressources :
# readinessProbe:
#   httpGet:
#     path: /api/health
#     port: 3000
#   initialDelaySeconds: 30
#   periodSeconds: 15
#   timeoutSeconds: 10
#   failureThreshold: 3
# livenessProbe:
#   httpGet:
#     path: /api/health
#     port: 3000
#   initialDelaySeconds: 120
#   periodSeconds: 30
#   timeoutSeconds: 15
#   failureThreshold: 3
# resources:
#   requests:
#     cpu: "1"
#     memory: "2Gi"
#   limits:
#     cpu: "2"
#     memory: "4Gi"

Now still have error like :
TypeError: controller[kState].transformAlgorithm is not a function
    at ignore-listed frames
 ⨯ TypeError: fetch failed
    at ignore-listed frames {
  [cause]: AggregateError:
      at ignore-listed frames {
    code: 'ETIMEDOUT'
  }
}
⨯ TypeError: fetch failed
    at ignore-listed frames {
  [cause]: AggregateError:
      at ignore-listed frames {
    code: 'ETIMEDOUT'
  }
}
 ⨯ TypeError: fetch failed
    at ignore-listed frames {
  [cause]: AggregateError:
      at ignore-listed frames {
    code: 'ETIMEDOUT'
  }
}
 ⨯ TypeError: fetch failed
    at ignore-listed frames {
  [cause]: AggregateError:
      at ignore-listed frames {
    code: 'ETIMEDOUT'
  }
}

Ive no idea what is going wrong honestly
And I have :
kubectl top pod -l app=web-app                                                                                                                                                            ─╯
NAME                       CPU(cores)   MEMORY(bytes)   
web-app-7855c6b95c-ggbh9   1079m        634Mi           
web-app-7855c6b95c-pg75l   1076m        529Mi  

But the website still loading...
can u show ur docker image
i didnt look at the error that well earlier tbh but fetch failed is weird. seems more like a networking issue rather than something else. but idk why that would cause cpu to spike
@Saint Hubert Jura Hound its loading? 🤔 like the pages are rendering in the browser n stuff, ur getting 200's?
Pacific herringOP
Wait Ive seen something. Without limit my pod hit this top :
 kubectl top pod -l app=web-app                                                                                                                                                            ─╯
NAME                       CPU(cores)   MEMORY(bytes)   
web-app-7855c6b95c-ggbh9   1052m        991Mi           
web-app-7855c6b95c-pg75l   1155m        606Mi 

Both pretty hight. But after few minute, the first one and the second one gonna decrease :
NAME                       CPU(cores)   MEMORY(bytes)   
web-app-7855c6b95c-ggbh9   384m         427Mi           
web-app-7855c6b95c-pg75l   516m         419Mi 

And now its working pretty good and fast (not the best in my opinion)
Can it be related to the emptyDir in k3s for the cache ?
There is my delpoyment.yaml :
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      imagePullSecrets:
        - name: github-credentials
      containers:
        - name: web-app
          image: myimage
          imagePullPolicy: Always
          ports:
            - containerPort: 3000
          readinessProbe:
            httpGet:
              path: /api/health
              port: 3000
            initialDelaySeconds: 30
            periodSeconds: 15
            timeoutSeconds: 10
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /api/health
              port: 3000
            initialDelaySeconds: 120
            periodSeconds: 30
            timeoutSeconds: 15
            failureThreshold: 3
          resources:
            requests:
              cpu: "1"
              memory: "2Gi"
            limits:
              cpu: "2"
              memory: "4Gi"
          env:
            - name: NODE_OPTIONS
              value: "--max-old-space-size=2048 --dns-result-order=ipv4first"
          envFrom:
            - secretRef:
                name: web-app-secret
          volumeMounts:
            - name: cache
              mountPath: /app/.next/cache

      volumes:
        - name: cache
          emptyDir:
            sizeLimit: "20Gi"
So I guess during start, my pod make some work (idk what), so maybe that why with ressource limit and probe it was crashing in loop my pods
@Saint Hubert Jura Hound wait so why are u mounting .next/cache in a volume?
Pacific herringOP
Idk ahah how Im supposed to do ? How can I protect the cache being flooded and make my. VPS (k3s node) no storage left ?
@Pacific herring Idk ahah how Im supposed to do ? How can I protect the cache being flooded and make my. VPS (k3s node) no storage left ?
Saint Hubert Jura Hound
well in a standalone build theres no server cache unless u enable a cachehandler im pretty sure. that folder is used for image optimization cache and fetch cache (based on what it says here: https://github.com/vercel/next.js/discussions/74683)
Saint Hubert Jura Hound
yea fetch cache?
maybe?
either way theres no need to mount that as a volume
@Saint Hubert Jura Hound yea fetch cache?
Pacific herringOP
yes fetch cache
@Pacific herring Idk ahah how Im supposed to do ? How can I protect the cache being flooded and make my. VPS (k3s node) no storage left ?
Saint Hubert Jura Hound
it would make no difference in terms of cash floods anyway. thats something u need to handle explicitly where necessary
i have a feeling it wont make a difference but try removing the volume. if that doesnt work can u show ur dockerfile?
@Saint Hubert Jura Hound i have a feeling it wont make a difference but try removing the volume. if that doesnt work can u show ur dockerfile?
Pacific herringOP
There is my dockerfile :
FROM node:20-alpine AS base
FROM base AS deps
ARG NPM_TOKEN
ENV NPM_TOKEN=${NPM_TOKEN}
RUN apk add --no-cache libc6-compat
WORKDIR /app
COPY package.json yarn.lock* package-lock.json* pnpm-lock.yaml* .npmrc* ./
RUN \
  if [ -f yarn.lock ]; then yarn --frozen-lockfile; \
  elif [ -f package-lock.json ]; then npm ci; \
  elif [ -f pnpm-lock.yaml ]; then corepack enable pnpm && pnpm i --frozen-lockfile; \
  else echo "Lockfile not found." && exit 1; \
  fi
FROM base AS builder
# LOADING HERE SOME ENV VAR BUT NOT DISCORD PREMIUM SO.... LIKE THIS
# ARG NEXT_PUBLIC_SITE_URL
# ENV NEXT_PUBLIC_SITE_URL=${NEXT_PUBLIC_SITE_URL}

WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY . .

RUN \
  if [ -f yarn.lock ]; then yarn run build; \
  elif [ -f package-lock.json ]; then npm run build; \
  elif [ -f pnpm-lock.yaml ]; then corepack enable pnpm && pnpm run build; \
  else echo "Lockfile not found." && exit 1; \
  fi

FROM base AS runner
WORKDIR /app

ENV NODE_ENV=production

RUN addgroup --system --gid 1001 nodejs
RUN adduser --system --uid 1001 nextjs

COPY --from=builder /app/public ./public

COPY --from=builder --chown=nextjs:nodejs /app/.next/standalone ./
COPY --from=builder --chown=nextjs:nodejs /app/.next/static ./.next/static

USER nextjs

EXPOSE 3000

ENV PORT=3000

ENV HOSTNAME="0.0.0.0"
CMD ["node", "server.js"]
@Pacific herring There is my dockerfile : yaml FROM node:20-alpine AS base FROM base AS deps ARG NPM_TOKEN ENV NPM_TOKEN=${NPM_TOKEN} RUN apk add --no-cache libc6-compat WORKDIR /app COPY package.json yarn.lock* package-lock.json* pnpm-lock.yaml* .npmrc* ./ RUN \ if [ -f yarn.lock ]; then yarn --frozen-lockfile; \ elif [ -f package-lock.json ]; then npm ci; \ elif [ -f pnpm-lock.yaml ]; then corepack enable pnpm && pnpm i --frozen-lockfile; \ else echo "Lockfile not found." && exit 1; \ fi FROM base AS builder # LOADING HERE SOME ENV VAR BUT NOT DISCORD PREMIUM SO.... LIKE THIS # ARG NEXT_PUBLIC_SITE_URL # ENV NEXT_PUBLIC_SITE_URL=${NEXT_PUBLIC_SITE_URL} WORKDIR /app COPY --from=deps /app/node_modules ./node_modules COPY . . RUN \ if [ -f yarn.lock ]; then yarn run build; \ elif [ -f package-lock.json ]; then npm run build; \ elif [ -f pnpm-lock.yaml ]; then corepack enable pnpm && pnpm run build; \ else echo "Lockfile not found." && exit 1; \ fi FROM base AS runner WORKDIR /app ENV NODE_ENV=production RUN addgroup --system --gid 1001 nodejs RUN adduser --system --uid 1001 nextjs COPY --from=builder /app/public ./public COPY --from=builder --chown=nextjs:nodejs /app/.next/standalone ./ COPY --from=builder --chown=nextjs:nodejs /app/.next/static ./.next/static USER nextjs EXPOSE 3000 ENV PORT=3000 ENV HOSTNAME="0.0.0.0" CMD ["node", "server.js"]
Saint Hubert Jura Hound
try switching off of alpine. and also off of node 20. its in maintenance until april this year anyway. best to upgrade to a new node version soon
but that will most likely fix ur issue