Next.js Discord

Discord Forum

Next.js 16 consuming 1+ CPU core per pod at idle on k3s - constant crash loops

Unanswered
Pacific herring posted this in #help-forum
Open in Discord
Pacific herringOP
I'm running Next.js 16.0.10 in production on a k3s cluster and experiencing severe performance issues that I didn't have before migrating to Kubernetes.

The problem:

* Each pod consumes ~1100m CPU (1+ core) constantly, even with zero traffic
* This causes readiness/liveness probes to timeout → pod restarts
* 124+ restarts in 22 hours, creating an endless crash loop
* The app starts fine (Ready in 153ms) but immediately spins CPU to 100%

Current metrics (with 0 traffic):

NAME CPU(cores) MEMORY(bytes)
web-app-xxx 1098m 339Mi
web-app-yyy 1177m 280Mi

Inside the pod (top):

PID 1 next-server 29% CPU VSZ 11.1g

Deployment config:

* Resources: 500m CPU request, 2Gi limit
* NODE_OPTIONS=--max-old-space-size=1536
* Using emptyDir for .next/cache (20Gi limit)
* Production build with output: 'standalone'

What I've tried:

* Adjusting probe timeouts (no effect)
* Lowering/raising memory limits
* Scaling to 1 pod vs multiple pods (same behavior)

This is a production app that's currently unusable. The app runs perfectly fine locally in development and when I build it locally with next build && next start, so I have no way to reproduce this behavior outside of the k3s environment. I'm stuck debugging in production which is not ideal.

Any insights would be greatly appreciated. I can provide additional logs, configs, or metrics if needed.

59 Replies

Pacific herringOP
I have all this kind of error logs :
 ⨯ Error: {"message":"TypeError: fetch failed","details":"TypeError: fetch failed\n\nCaused by: AggregateError:  (ETIMEDOUT)\nAggregateError: \n    at internalConnectMultiple (node:net:1122:18)\n    at internalConnectMultiple (node:net:1190:5)\n    at Timeout.internalConnectMultipleTimeout (node:net:1716:5)\n    at listOnTimeout (node:internal/timers:583:11)\n    at process.processTimers (node:internal/timers:519:7)","hint":"","code":""}
    at ignore-listed frames {
  digest: '3713074019'
}
 ⨯ Error: {"message":"TypeError: fetch failed","details":"TypeError: fetch failed\n\nCaused by: AggregateError:  (ETIMEDOUT)\nAggregateError: \n    at internalConnectMultiple (node:net:1122:18)\n    at internalConnectMultiple (node:net:1190:5)\n    at Timeout.internalConnectMultipleTimeout (node:net:1716:5)\n    at listOnTimeout (node:internal/timers:583:11)\n    at process.processTimers (node:internal/timers:519:7)","hint":"","code":""}
    at ignore-listed frames {
  digest: '3713074019'
}

 ⨯ Error: {"message":"TypeError: fetch failed","details":"TypeError: fetch failed\n\nCaused by: AggregateError:  (ETIMEDOUT)\nAggregateError: \n    at internalConnectMultiple (node:net:1122:18)\n    at internalConnectMultiple (node:net:1190:5)\n    at Timeout.internalConnectMultipleTimeout (node:net:1716:5)\n    at listOnTimeout (node:internal/timers:583:11)\n    at process.processTimers (node:internal/timers:519:7)","hint":"","code":""}
    at ignore-listed frames {
  digest: '3713074019'
}
Saint Hubert Jura Hound
Try requesting 2 cores, dont set any mem or cpu limits, also remove the liveness probe for now, lemme know what happens
@Saint Hubert Jura Hound Try requesting 2 cores, dont set any mem or cpu limits, also remove the liveness probe for now, lemme know what happens
Pacific herringOP
So Ive disable every probe and ressources :
# readinessProbe:
#   httpGet:
#     path: /api/health
#     port: 3000
#   initialDelaySeconds: 30
#   periodSeconds: 15
#   timeoutSeconds: 10
#   failureThreshold: 3
# livenessProbe:
#   httpGet:
#     path: /api/health
#     port: 3000
#   initialDelaySeconds: 120
#   periodSeconds: 30
#   timeoutSeconds: 15
#   failureThreshold: 3
# resources:
#   requests:
#     cpu: "1"
#     memory: "2Gi"
#   limits:
#     cpu: "2"
#     memory: "4Gi"

Now still have error like :
TypeError: controller[kState].transformAlgorithm is not a function
    at ignore-listed frames
 ⨯ TypeError: fetch failed
    at ignore-listed frames {
  [cause]: AggregateError:
      at ignore-listed frames {
    code: 'ETIMEDOUT'
  }
}
⨯ TypeError: fetch failed
    at ignore-listed frames {
  [cause]: AggregateError:
      at ignore-listed frames {
    code: 'ETIMEDOUT'
  }
}
 ⨯ TypeError: fetch failed
    at ignore-listed frames {
  [cause]: AggregateError:
      at ignore-listed frames {
    code: 'ETIMEDOUT'
  }
}
 ⨯ TypeError: fetch failed
    at ignore-listed frames {
  [cause]: AggregateError:
      at ignore-listed frames {
    code: 'ETIMEDOUT'
  }
}

Ive no idea what is going wrong honestly
And I have :
kubectl top pod -l app=web-app                                                                                                                                                            ─╯
NAME                       CPU(cores)   MEMORY(bytes)   
web-app-7855c6b95c-ggbh9   1079m        634Mi           
web-app-7855c6b95c-pg75l   1076m        529Mi  

But the website still loading...
can u show ur docker image
i didnt look at the error that well earlier tbh but fetch failed is weird. seems more like a networking issue rather than something else. but idk why that would cause cpu to spike
@Saint Hubert Jura Hound its loading? 🤔 like the pages are rendering in the browser n stuff, ur getting 200's?
Pacific herringOP
Wait Ive seen something. Without limit my pod hit this top :
 kubectl top pod -l app=web-app                                                                                                                                                            ─╯
NAME                       CPU(cores)   MEMORY(bytes)   
web-app-7855c6b95c-ggbh9   1052m        991Mi           
web-app-7855c6b95c-pg75l   1155m        606Mi 

Both pretty hight. But after few minute, the first one and the second one gonna decrease :
NAME                       CPU(cores)   MEMORY(bytes)   
web-app-7855c6b95c-ggbh9   384m         427Mi           
web-app-7855c6b95c-pg75l   516m         419Mi 

And now its working pretty good and fast (not the best in my opinion)
Can it be related to the emptyDir in k3s for the cache ?
There is my delpoyment.yaml :
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      imagePullSecrets:
        - name: github-credentials
      containers:
        - name: web-app
          image: myimage
          imagePullPolicy: Always
          ports:
            - containerPort: 3000
          readinessProbe:
            httpGet:
              path: /api/health
              port: 3000
            initialDelaySeconds: 30
            periodSeconds: 15
            timeoutSeconds: 10
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /api/health
              port: 3000
            initialDelaySeconds: 120
            periodSeconds: 30
            timeoutSeconds: 15
            failureThreshold: 3
          resources:
            requests:
              cpu: "1"
              memory: "2Gi"
            limits:
              cpu: "2"
              memory: "4Gi"
          env:
            - name: NODE_OPTIONS
              value: "--max-old-space-size=2048 --dns-result-order=ipv4first"
          envFrom:
            - secretRef:
                name: web-app-secret
          volumeMounts:
            - name: cache
              mountPath: /app/.next/cache

      volumes:
        - name: cache
          emptyDir:
            sizeLimit: "20Gi"
So I guess during start, my pod make some work (idk what), so maybe that why with ressource limit and probe it was crashing in loop my pods
@Saint Hubert Jura Hound wait so why are u mounting .next/cache in a volume?
Pacific herringOP
Idk ahah how Im supposed to do ? How can I protect the cache being flooded and make my. VPS (k3s node) no storage left ?
@Pacific herring Idk ahah how Im supposed to do ? How can I protect the cache being flooded and make my. VPS (k3s node) no storage left ?
Saint Hubert Jura Hound
well in a standalone build theres no server cache unless u enable a cachehandler im pretty sure. that folder is used for image optimization cache and fetch cache (based on what it says here: https://github.com/vercel/next.js/discussions/74683)
Saint Hubert Jura Hound
yea fetch cache?
maybe?
either way theres no need to mount that as a volume
@Saint Hubert Jura Hound yea fetch cache?
Pacific herringOP
yes fetch cache
@Pacific herring Idk ahah how Im supposed to do ? How can I protect the cache being flooded and make my. VPS (k3s node) no storage left ?
Saint Hubert Jura Hound
it would make no difference in terms of cash floods anyway. thats something u need to handle explicitly where necessary
i have a feeling it wont make a difference but try removing the volume. if that doesnt work can u show ur dockerfile?
@Saint Hubert Jura Hound i have a feeling it wont make a difference but try removing the volume. if that doesnt work can u show ur dockerfile?
Pacific herringOP
There is my dockerfile :
FROM node:20-alpine AS base
FROM base AS deps
ARG NPM_TOKEN
ENV NPM_TOKEN=${NPM_TOKEN}
RUN apk add --no-cache libc6-compat
WORKDIR /app
COPY package.json yarn.lock* package-lock.json* pnpm-lock.yaml* .npmrc* ./
RUN \
  if [ -f yarn.lock ]; then yarn --frozen-lockfile; \
  elif [ -f package-lock.json ]; then npm ci; \
  elif [ -f pnpm-lock.yaml ]; then corepack enable pnpm && pnpm i --frozen-lockfile; \
  else echo "Lockfile not found." && exit 1; \
  fi
FROM base AS builder
# LOADING HERE SOME ENV VAR BUT NOT DISCORD PREMIUM SO.... LIKE THIS
# ARG NEXT_PUBLIC_SITE_URL
# ENV NEXT_PUBLIC_SITE_URL=${NEXT_PUBLIC_SITE_URL}

WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY . .

RUN \
  if [ -f yarn.lock ]; then yarn run build; \
  elif [ -f package-lock.json ]; then npm run build; \
  elif [ -f pnpm-lock.yaml ]; then corepack enable pnpm && pnpm run build; \
  else echo "Lockfile not found." && exit 1; \
  fi

FROM base AS runner
WORKDIR /app

ENV NODE_ENV=production

RUN addgroup --system --gid 1001 nodejs
RUN adduser --system --uid 1001 nextjs

COPY --from=builder /app/public ./public

COPY --from=builder --chown=nextjs:nodejs /app/.next/standalone ./
COPY --from=builder --chown=nextjs:nodejs /app/.next/static ./.next/static

USER nextjs

EXPOSE 3000

ENV PORT=3000

ENV HOSTNAME="0.0.0.0"
CMD ["node", "server.js"]
@Pacific herring There is my dockerfile : yaml FROM node:20-alpine AS base FROM base AS deps ARG NPM_TOKEN ENV NPM_TOKEN=${NPM_TOKEN} RUN apk add --no-cache libc6-compat WORKDIR /app COPY package.json yarn.lock* package-lock.json* pnpm-lock.yaml* .npmrc* ./ RUN \ if [ -f yarn.lock ]; then yarn --frozen-lockfile; \ elif [ -f package-lock.json ]; then npm ci; \ elif [ -f pnpm-lock.yaml ]; then corepack enable pnpm && pnpm i --frozen-lockfile; \ else echo "Lockfile not found." && exit 1; \ fi FROM base AS builder # LOADING HERE SOME ENV VAR BUT NOT DISCORD PREMIUM SO.... LIKE THIS # ARG NEXT_PUBLIC_SITE_URL # ENV NEXT_PUBLIC_SITE_URL=${NEXT_PUBLIC_SITE_URL} WORKDIR /app COPY --from=deps /app/node_modules ./node_modules COPY . . RUN \ if [ -f yarn.lock ]; then yarn run build; \ elif [ -f package-lock.json ]; then npm run build; \ elif [ -f pnpm-lock.yaml ]; then corepack enable pnpm && pnpm run build; \ else echo "Lockfile not found." && exit 1; \ fi FROM base AS runner WORKDIR /app ENV NODE_ENV=production RUN addgroup --system --gid 1001 nodejs RUN adduser --system --uid 1001 nextjs COPY --from=builder /app/public ./public COPY --from=builder --chown=nextjs:nodejs /app/.next/standalone ./ COPY --from=builder --chown=nextjs:nodejs /app/.next/static ./.next/static USER nextjs EXPOSE 3000 ENV PORT=3000 ENV HOSTNAME="0.0.0.0" CMD ["node", "server.js"]
Saint Hubert Jura Hound
try switching off of alpine. and also off of node 20. its in maintenance until april this year anyway. best to upgrade to a new node version soon
but that will most likely fix ur issue
@Pacific herring Which node version should I use ? 18 or something more recent ?
Saint Hubert Jura Hound
Pacific herringOP
v20 seems in maintenance no ?
Saint Hubert Jura Hound
Yep thats (part of) why its the minimum node version
Pacific herringOP
So can I use this : FROM node:24-bookworm-slim AS base ?
Saint Hubert Jura Hound
That sounds like itll be worth a shot
Lts stands for long term service. So thats usually the one to go with
Pacific herringOP
Okay thanks ! Im gonna try this Dockerfile :
FROM node:24-bookworm-slim AS base

FROM base AS deps
ARG NPM_TOKEN
ENV NPM_TOKEN=${NPM_TOKEN}
WORKDIR /app
COPY package.json yarn.lock* package-lock.json* pnpm-lock.yaml* .npmrc* ./
RUN \
  if [ -f yarn.lock ]; then yarn --frozen-lockfile; \
  elif [ -f package-lock.json ]; then npm ci; \
  elif [ -f pnpm-lock.yaml ]; then corepack enable pnpm && pnpm i --frozen-lockfile; \
  else echo "Lockfile not found." && exit 1; \
  fi

FROM base AS builder
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY . .
RUN \
  if [ -f yarn.lock ]; then yarn run build; \
  elif [ -f package-lock.json ]; then npm run build; \
  elif [ -f pnpm-lock.yaml ]; then corepack enable pnpm && pnpm run build; \
  else echo "Lockfile not found." && exit 1; \
  fi

FROM base AS runner
WORKDIR /app
ENV NODE_ENV=production
RUN addgroup --system --gid 1001 nodejs
RUN adduser --system --uid 1001 nextjs
COPY --from=builder /app/public ./public
COPY --from=builder --chown=nextjs:nodejs /app/.next/standalone ./
COPY --from=builder --chown=nextjs:nodejs /app/.next/static ./.next/static
USER nextjs
EXPOSE 3000
ENV PORT=3000
ENV HOSTNAME="0.0.0.0"
CMD ["node", "server.js"]

My app work rn but still slow I have to find the reason because having 2 pods wiht 1vCPU and 2gb RAM each seems weird. Or maybe its a common issue that nextjs is slow :lolsob:
Saint Hubert Jura Hound
are the fetch errors gone?
Saint Hubert Jura Hound
lol i cant believe this but i just tried setting requests and limits on my nextjs app since i hadnt done that yet, and somehow it also started getting incredibly slow
removing the requests/limits sped it up again
i didnt test whether its cpu or mem that did it
there was basically no difference in resource usage between including and removing the limits tho.
// with req/limits
NAME                        CPU(cores)   MEMORY(bytes)
frontend-674d64b566-28jmj   12m          186Mi
frontend-674d64b566-vx4td   21m          181Mi

// without
NAME                        CPU(cores)   MEMORY(bytes)
frontend-578b4497bf-4r2qh   7m           136Mi
frontend-578b4497bf-z6lqk   5m           126Mi
@Saint Hubert Jura Hound are the fetch errors gone?
Pacific herringOP
Well one of my node completely crash make all my stack crash too :lolsob:
@Saint Hubert Jura Hound there was basically no difference in resource usage between including and removing the limits tho. // with req/limits NAME CPU(cores) MEMORY(bytes) frontend-674d64b566-28jmj 12m 186Mi frontend-674d64b566-vx4td 21m 181Mi // without NAME CPU(cores) MEMORY(bytes) frontend-578b4497bf-4r2qh 7m 136Mi frontend-578b4497bf-z6lqk 5m 126Mi
Pacific herringOP
wait how can you have 7m CPU and 136Mi of RAM... why im getting something like this in iddle :
kubectl top pod -l app=web-app                                                                                            ─╯
NAME                       CPU(cores)   MEMORY(bytes)   
web-app-7dbc844b7d-w4trq   341m         544Mi    
@Pacific herring Well one of my node completely crash make all my stack crash too <:lolsob:753870958489632819>
Saint Hubert Jura Hound
🤣🤣 same here but for a different reason lmaooo
@Saint Hubert Jura Hound Did you still have requests set here? Maybe try to remove them
Pacific herringOP
I have no req/limits and no probes... but have :
web-app-7dbc844b7d-w4trq   283m         1079Mi   

But idk if considering my trafic is more than it should be
Saint Hubert Jura Hound
Yea traffic definitely has an impact here. My pods were under no load. But i dont think it should make THIS big of a difference
Unlessss ur loading some stuff into memory that u shouldnt be
@Saint Hubert Jura Hound Unlessss ur loading some stuff into memory that u shouldnt be
Pacific herringOP
What do you means by “loading some stuff into memory” ? Bc I’m making some fetch server side (and I cache few of them with next cache)
Saint Hubert Jura Hound
Ive seen that this is possible for example when using internationalization libraries, if theyre loaded incorrectly, and especially if theyre cached after, it could cause high memory or bundle size. Im just giving an example idk if this could actually be the case
But it might be worth looking into bc theres no reason for ur pods to use that much mem i dont think
Ill probably do a stress test later this week or next week with proper observability to get a better idea of resource usage on a next pod though bc i dont know whats considered normal usage
But my go pods are idling at 3 milicores and 20 mebibytes mem so lol
@Saint Hubert Jura Hound But my go pods are idling at 3 milicores and 20 mebibytes mem so lol
Pacific herringOP
ahah my pods are mining bitcoin maybe idk
@Pacific herring I use next-intl and it can be a problem in fact
Saint Hubert Jura Hound
yep that might be it then
@Pacific herring ahah my pods are mining bitcoin maybe idk
Saint Hubert Jura Hound
🤣 or that
@Saint Hubert Jura Hound yep that might be it then
Pacific herringOP
web-app-7dbc844b7d-nl8xc   645m         1859Mi          
web-app-7dbc844b7d-w4trq   224m         1957Mi 

everything fine
I have 53 actives user but seems pretty high anyway. So if I had 10k users I have to buy a data center or what
Pacific herringOP
In local, with docker compose :
services:
  web:
    build:
      context: .
      dockerfile: Dockerfile
    environment:
      NODE_OPTIONS: "--inspect=0.0.0.0:9229"
    ports:
      - "3000:3000"
      - "9229:9229"
    restart: unless-stopped

it use :
3e0295d29416   web-app-web-1   0.06%     179.9MiB / 7.653GiB   2.30%     510kB / 5.12MB   12.7MB / 24.6kB   12

still high no ?
Pacific herringOP
That bad :
╰─ hey -z 60s -c 50 http://localhost:3000/                                                                                                                                                                                                                                                   ─╯

Summary:
  Total:        60.1100 secs
  Slowest:      2.9923 secs
  Fastest:      0.1099 secs
  Average:      1.3985 secs
  Requests/sec: 35.7345
  

Response time histogram:
  0.110 [1]     |
  0.398 [9]     |
  0.686 [12]    |
  0.975 [3]     |
  1.263 [8]     |
  1.551 [1993]  |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  1.839 [109]   |■■
  2.128 [1]     |
  2.416 [0]     |
  2.704 [2]     |
  2.992 [10]    |


Latency distribution:
  10% in 1.3335 secs
  25% in 1.3498 secs
  50% in 1.3726 secs
  75% in 1.4318 secs
  90% in 1.5152 secs
  95% in 1.5580 secs
  99% in 1.6954 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.0001 secs, 0.1099 secs, 2.9923 secs
  DNS-lookup:   0.0001 secs, 0.0000 secs, 0.0024 secs
  req write:    0.0000 secs, 0.0000 secs, 0.0005 secs
  resp wait:    0.2608 secs, 0.0307 secs, 1.4984 secs
  resp read:    1.1376 secs, 0.0488 secs, 1.5187 secs

Status code distribution:
  [200] 2148 responses
@Pacific herring I have 53 actives user but seems pretty high anyway. So if I had 10k users I have to buy a data center or what
Saint Hubert Jura Hound
hm i mean thats better the cpu is still kinda high but the memory seems normal for a nodejs app. and the effect of caching will be greater the more users u have yk. if theres only 50 then there will be a lot less cache hits than if there was 500