Next.js Discord

Discord Forum

Vercel 408 Function Timeouts

Unanswered
Cuvier’s Dwarf Caiman posted this in #help-forum
Open in Discord
Cuvier’s Dwarf CaimanOP
When I upgrade from the latest 15.x to 16.x my service begins to spew 408 Request Timeout errors. Rolling back to 15.x fixes the issue. I tried various versions of 16.x, including the latest with the same result.

- I use fluid compute
- I have a PG pool with attachDatabasePool configured
- there is no rhyme or reason to the pages or routes this effects, it’s all of them.
- my application only uses app router

My max duration is set to 5 minutes.
Middleware resolves in milliseconds.
Response finished in 100s of milliseconds.
Function timeout after 5 minutes.

What would cause this to happen? How can I fix it? I’ve posted for assistance on Vercel official support forum before and received no help. I am out of ideas, and cannot reproduce the issue in a preview environment no matter what I’ve tried, and I’ve tried hundreds of things.

Any advice would be appreciated, thank you kindly.

61 Replies

@Cuvier’s Dwarf Caiman When I upgrade from the latest 15.x to 16.x my service begins to spew 408 Request Timeout errors. Rolling back to 15.x fixes the issue. I tried various versions of 16.x, including the latest with the same result. - I use fluid compute - I have a PG pool with attachDatabasePool configured - there is no rhyme or reason to the pages or routes this effects, it’s all of them. - my application only uses app router My max duration is set to 5 minutes. Middleware resolves in milliseconds. Response finished in 100s of milliseconds. Function timeout after 5 minutes. What would cause this to happen? How can I fix it? I’ve posted for assistance on Vercel official support forum before and received no help. I am out of ideas, and cannot reproduce the issue in a preview environment no matter what I’ve tried, and I’ve tried hundreds of things. Any advice would be appreciated, thank you kindly.
That response finishing in 100s of ms but function timing out after 5 min means something is keeping the function alive after the response is done. Usually an open connection or listener thats not cleaning up. With fluid compute + pg pool, my guess is next 16 changed how it signals the function is done, so vercel thinks its still running. You could check if it happens on routes that dont touch the db at all. if those work fine, its definitely the pool or try explicitly calling pool.end() or releasing connections in a
finally block and see if you have any event listeners or streams that arent being closed. Also what pg client are you using? @vercel/postgres or something like node-postgres directly? The attachDatabasePool behavior might be different in 16.
Cuvier’s Dwarf CaimanOP
I have been thinking along the same lines.

Unfortunately this happens across every page and route, even pages that aren’t explicitly interacting with the database. But I can try to explicitly release and see what it does, I’ve tried it before.

I’ve also been unable to reproduce this issue in preview environments. Even happens in production during off hours when less users are online.
In any case you’re the first person who has offered an opinion on this, so I appreciate your advice.
For db connections I’m using Prisma v6.19.2
I also have a read replica pool configured. So in addition to releasing, I may temporarily remove the read replica pool entirely, and see if having a single db connection for everything helps.
As for testing what I am thinking of doing is:

- deploying the change to Vercel
- immediately promoting a pre 16.x release so it’s not customer facing
- testing against the 16.x internal Vercel app

If you have ideas for how to test against a production environment that don’t require that ridiculous dance I’m all ears 👂
@Cuvier’s Dwarf Caiman In any case you’re the first person who has offered an opinion on this, so I appreciate your advice.
Im more than happy to help, i know how frustrating it can be when you are stuck on something and no one has answers, lets hope we can figure this one out.
Prisma with read replicas adds complexity for sure. Removing the read replica pool temporarily is a good test. If the timeouts stop you know its something with how Prisma handles multiple pools in 16. Could also check if Prisma has any open issues with Next 16 specifically. For testing without that promotion dance, could you create a separate Vercel project pointing to the same repo but on a different branch? That way you get a real prod environment with prod env vars but its isolated. Or use deployment protection on the 16.x deploy so only you can access it while its live.
Cuvier’s Dwarf CaimanOP
I’ll try the deployment protection feature first, see how that goes, and get back to you 🫡
Cuvier’s Dwarf CaimanOP
The feature I needed to disable ended up being under Settings > Environments > Production

So far I am not seeing any errors. I did upgrade to the latest Prisma v6.x in this release, so it’s possible that’s the issue, but I’ll wait a bit since it takes at least 5-10m before I see the timeouts after a deploy iirc.

I’m hoping that this was just a Prisma bug this whole time.
Cuvier’s Dwarf CaimanOP
So far so good, I don’t see any errors. I do see a ton of release event logs from @vercel/functions though. I’m on v3.3.6 of that package.

Perhaps something to do with it logging Prisma releasing the db connections?
I won’t know for sure until I put this change live to the masses. I’ll probably do that later tonight (EST time zone) or tomorrow morning
@Cuvier’s Dwarf Caiman So far so good, I don’t see any errors. I do see a ton of release event logs from @vercel/functions though. I’m on v3.3.6 of that package. Perhaps something to do with it logging Prisma releasing the db connections?
good to know, yeah those release logs are probably Prisma connections cleaning up properly now. Let me know how the prod deploy goes, good luck!
Cuvier’s Dwarf CaimanOP
Unfortunately after having it live for over 12 hours, it’s throwing 408s again.
It’s certainly less noisy than in previous releases I’ve done of 16.x, but they may just have to do with traffic today.
Cuvier’s Dwarf CaimanOP
Most finish there responses way before 5m, but 1m response times is also pretty suspicious here
Yea I just confirmed when this happens the app becomes unresponsive for some subset of site visits, which basically means it’s an outage
Cuvier’s Dwarf CaimanOP
Looking at database connections the only thing I can mention that’s notable is that the # of supervisor connections increased from 2 to 4 and then 6 this morning. Which isn’t that crazy. May just be devs connecting to investigate what’s going on, or other third party data connections.

Postgres, pgbouncer, and “client connections waiting” were all consistent otherwise.

Definitely just seems like a bug between Vercel and one of my libraries. Something must be keeping the functions alive. What’s worse is that it gets busy enough that people can’t visit the site anymore, not sure why that might be?
Cuvier’s Dwarf CaimanOP
These happen much less often, in 12h I only see 23 of these, but still interesting that they breach the 5m max
Cuvier’s Dwarf CaimanOP
Next I'll probably try to omit the read replica and second attachDatabasePool call and see whether that helps here
Unfortunate that I can't reproduce this issue though, even in a non-customer facing app
@Cuvier’s Dwarf Caiman Unfortunate that I can't reproduce this issue though, even in a non-customer facing app
The 10m execution with no outgoing requests means something internal is blocking so your plan to drop the read replica and second attachDatabasePool is the right move. Under load both pools could be competing for connections and requests just queue up until timeout. If you are using Prisma make sure the serverless connection handling is set up right since it loves leaking connections without Accelerate or an external pooler. Let me know how the test goes.
Cuvier’s Dwarf CaimanOP
Yea, that is definitely going to be my next trial here
I am gonna start by removing the read replica and additional pool, and see whether that helps.
Cuvier’s Dwarf CaimanOP
Still timing out without the replica pool
Cuvier’s Dwarf CaimanOP
I guess the only variables here I have left to play with are the connection string or pg pool configuration, but I really don’t see why those would need to change just for me to release next 16.x.

I don’t have any other changes in this release, it’s quite simply an upgrade to next 16 and nothing else is changing.
@Cuvier’s Dwarf Caiman I guess the only variables here I have left to play with are the connection string or pg pool configuration, but I really don’t see why those would need to change just for me to release next 16.x. I don’t have any other changes in this release, it’s quite simply an upgrade to next 16 and nothing else is changing.
if the only change was the Next 16 upgrade then it might be worth checking the Next.js GitHub issues to see if others are hitting this and maybe roll back to 15 temporarily just to confirm thats whats causing it.
Cuvier’s Dwarf CaimanOP
I have rolled back to 15 half a dozen times now, it always fixes the problem. These are just my attempts to roll forward. I’ve been investigating this issue for 4-5 months now (on and off when I can)
I’ve read every relevant GitHub issue, vercel community post, stackoverflow, Reddit, discord, etc etc.
I’m getting closer though, I know it.
Cuvier’s Dwarf CaimanOP
I've dug deeper into this and have learned something critical.
The waitUntil method that we are trying to pull out of request context in the attachDatabasePool method is undefined.
https://github.com/vercel/vercel/blob/5003e8165447a74fa4b02e34c00b1ba6bb6bf4cf/packages/functions/src/db-connections/index.ts#L191C1-L196C4

So I always receive the log console.warn('Pool release event triggered outside of request scope.');

Which means cleanup of the db pool is never happening at the correct time.
Internally waitUntil also just calls getContext here: https://github.com/vercel/vercel/blob/%40vercel/functions%403.3.6/packages/functions/src/wait-until.ts

Which would also do nothing, because I've confirmed that getContext() returns nothing in my next 16 app
@Cuvier’s Dwarf Caiman Internally waitUntil also just calls getContext here: https://github.com/vercel/vercel/blob/%40vercel/functions%403.3.6/packages/functions/src/wait-until.ts Which would also do nothing, because I've confirmed that getContext() returns nothing in my next 16 app
oh nice, thats a big find actually. if getContext() returns nothing then waitUntil silently no-ops and your pool connections just pile up forever, which would definitely explain the 408s over time. in Next.js 16 the move is to use the after() API from next/server instead — it handles post-response work without relying on the vercel request context. you would import it and call after(pool.end()) or whatever your cleanup logic is directly in your route handler. that or skip attachDatabasePool entirely and roll your own idle timeout since that helper is basically broken without getContext. Have you seen this thread? https://github.com/vercel/next.js/discussions/50441 It might have already crossed your radar but theres some relevant stuff in there about the after() workaround
Cuvier’s Dwarf CaimanOP
Appreciate you sharing this, I came across it but didn’t dig deep enough into it.

It’s funny you shared this though, as I’ve already begun writing my own custom pool maintenance based on after().
I’m hoping that will resolve it when I’m finished, but worried some other “gotcha” here will stop me from making it functional.
I’m a little surprised by all this tbh.
@Cuvier’s Dwarf Caiman I’m a little surprised by all this tbh.
yeah the after() situation is one of those things that looks simple on the surface but the implementation details are tricky, especially around error handling and making sure the pool actually releases connections when the cleanup runs. the biggest gotcha i have seen people hit is that after() does not guarantee execution order if you have multiple callbacks registered, so if your pool maintenance depends on a specific teardown sequence you will want to handle that inside a single after() call rather than chaining them. The other thing to watch for is cold starts. if you are on serverless, the runtime can kill the process before after() finishes, so anything critical to data integrity should not live there. Pool cleanup is fine since worst case
you just get a stale connection on next invocation, but worth knowing.
Cuvier’s Dwarf CaimanOP
I have been live with Next 16 for about an hour. I did see a spike to about 5% timeouts at one point, but then it subsided. This release used my new pool management based on after(). While this is much better, I am still worried about these random timeout spikes. Wish I could get this to behave as consistently as 15 was behaving.
I am approaching peak usage this time of day for my product, and will probably see a small bump in an hour or so, so I am thinking that will be the real test of whether this is stable enough to keep in a live customer facing environment.
Cuvier’s Dwarf CaimanOP
I had an LLM study the hour of my Vercel logs (120MBs) after exporting them, and identify similar scenarios where requests lead to timeouts. In all cases the memory was fairly high. So the theory is for these cases that it has something to do with the way data is being streamed from a React Server Component, or perhaps some heavy data serialization. But not sure yet.

Looking at historical patterns in the observability tab I can see that memory with nextjs 16 is definitely higher by 2-300MBs, but that's nowhere near the 2GB hard limit, or 1.75GB soft limit that Vercel draws. So I would think it's not the problem, but somehow a symptom of an underlying issue here.
@Cuvier’s Dwarf Caiman I had an LLM study the hour of my Vercel logs (120MBs) after exporting them, and identify similar scenarios where requests lead to timeouts. In all cases the memory was fairly high. So the theory is for these cases that it has something to do with the way data is being streamed from a React Server Component, or perhaps some heavy data serialization. But not sure yet. Looking at historical patterns in the observability tab I can see that memory with nextjs 16 is definitely higher by 2-300MBs, but that's nowhere near the 2GB hard limit, or 1.75GB soft limit that Vercel draws. So I would think it's not the problem, but somehow a symptom of an underlying issue here.
the 200-300MB memory jump and timeouts correlating with high memory sounds like RSC serialization overhead. even though you are not near the hard limit, Node GC pauses get worse with larger heaps. when multiple concurrent requests are each serializing large RSC payloads the GC kicks in to clean up and that pause is what shows up as a timeout spike. the pattern you are seeing where it spikes then subsides is classic GC pressure under concurrent load, not a memory leak.
if you can identify which routes are triggering the timeouts, try moving the heavy data fetching to client side with something like React Query instead of passing it through RSC. that cuts the serialization overhead on the server. if you want to keep it in RSC, breaking the heavy components into separate Suspense boundaries so the payload streams in chunks instead of one large serialization could help too. smaller chunks mean shorter GC pauses per request.
Cuvier’s Dwarf CaimanOP
Unfortunately I am not so sure that is the issue exactly even though it probably would help to identify areas where larger payloads are serialized. We don't really have many or any pages that serialize large responses tbh.
Take this 408 for example on the homepage, It had a fluid compute peak concurrency of 1. Yet it timed out, had low memory usage, middleware resolved quickly, and the response says it took 60s which is very odd.
The homepage of this site does not serialize any data, or make any async calls at all, so I can't see how that might help us here
In other cases I see the Peak Concurrency request count is much higher, and so is mem
So I imagine for those cases that reducing async db calls for RSCs would help some
Does fluid compute share all request types? Like are Route handlers, and open graph image endpoints, and RSCs, etc all shared amongst the same function invocations on Vercel?
I have a feeling that image I shared above had just not yet timed out or something, as I noticed it didn't even include a Function Invocation section, which is odd. So maybe that was a misdirection.
@Cuvier’s Dwarf Caiman I have a feeling that image I shared above had just not yet timed out or something, as I noticed it didn't even include a `Function Invocation` section, which is odd. So maybe that was a misdirection.
that screenshot is strange. homepage with no data fetching, 312MB memory, 28ms middleware, hot start, single request, and still times out at 60s. that is not your code. if the homepage makes zero async calls and still hits 60s, something on Vercel's side is hanging.
To your question, yes all function types share the same pool on fluid compute. RSCs, route handlers, OG image endpoints all get scheduled together. the missing Function Invocation section in that screenshot is suspicious though, it might mean the request never actually reached your function code. could be stuck in their edge layer. worth opening a support ticket with that specific request ID. this looks like a platform bug.
Cuvier’s Dwarf CaimanOP
I’ve managed to get it down to about 2% of traffic by enabling a few things, and tuning the after() code, but it could also just be site demand or traffic shape, not sure.

What I do know is that this recent post on use workflow is exactly the same as mine: https://github.com/vercel/workflow/issues/943

at least in the symptoms. But I don’t use useworkflow on this site at all.

It’s possible this is a Vercel bug, or something I’m just unaware of is somehow keeping the event loop busy sometimes. Like a third party library or the @vercel/otel package, or something else.
I’ll keep an eye out for problematic request IDs, and look into submitting a ticket. When I tried to submit a ticket yesterday via my phone it failed because the page doesn’t work on iOS 16 anymore, so I ended up losing the Request ID unfortunately 😔
I believe this is what they call a Comedy of Errors
Cuvier’s Dwarf CaimanOP
A few questions for you if you don’t mind:

1. does Vercel offer a way to measure the RSC finished streaming response size? Would be helpful in identifying if some random large payload was hiding and I wasn’t aware of it. Even if a small number of requests was sampled as a trace in otel.
2. Does Vercel have a way to profile a running node process and determine what is keeping the event loop busy? (I do this in datadog apm on my backend for about 5 years now, helpful for solving wall time and heap issues)
3. Does Vercel have the ability to temporarily enable Observability Plus on my account just so I can inspect some of these finer details in case it is actually a platform issue? I can’t afford OP right now, as my company is just 3 people, but want to figure out what’s going wrong here.

Thank you for reading 🙏 and helping me along the way here
@Cuvier’s Dwarf Caiman A few questions for you if you don’t mind: 1. does Vercel offer a way to measure the RSC finished streaming response size? Would be helpful in identifying if some random large payload was hiding and I wasn’t aware of it. Even if a small number of requests was sampled as a trace in otel. 2. Does Vercel have a way to profile a running node process and determine what is keeping the event loop busy? (I do this in datadog apm on my backend for about 5 years now, helpful for solving wall time and heap issues) 3. Does Vercel have the ability to temporarily enable Observability Plus on my account just so I can inspect some of these finer details in case it is actually a platform issue? I can’t afford OP right now, as my company is just 3 people, but want to figure out what’s going wrong here. Thank you for reading 🙏 and helping me along the way here
Nice that you got it down to 2%, that workflow issue is probably unrelated though since those are functions that finish but hang, and yours are requests that never reach the function at all based on the missing Function Invocation logs. For the RSC payload size, there is no built-in metric in the Vercel dashboard for that. You can filter for ?_rsc= in devtools network tab to see transfer sizes per request. for production visibility you would need custom OTEL instrumentation to measure it, which is a bit of work to set up.

For profiling the event loop, Vercel does not have anything like Datadog APM continuous profiling. the functions are ephemeral so there is nothing to attach to. you can use monitorEventLoopDelay from perf_hooks to at least detect if the loop is lagging, but it will not tell you what is causing it. since you already run Datadog on your backend, reproducing the issue on a long-running node process with dd-trace attached might be the fastest path to real profiling data.

For Observability Plus, i do not think there is a free trial but it is $10/month on Pro and prorated. so you could enable it, debug for a few days, and disable it for under $10. It might be worth checking with their support in case they can do something for your situation. One thing to keep in mind, if the 408s have no Function Invocation entry, none of this instrumentation would catch them anyway because your code never runs on those requests.
Cuvier’s Dwarf CaimanOP
I've gone ahead and finally submitted a case on Vercel with some request IDs, as I was able to find (many) reproductions of the issue where no function invocation was displayed whatsoever in the logs for the homepage and other landing pages.
Cuvier’s Dwarf CaimanOP
Just wanted to follow up here that I have identified the actual problem.
It ended up being a needle in the hay-stack style problem, so finding it was hard.

Next.js 15 -> 16 made all serverless fetches uncached by default.
A developer years ago had created a tiny request that was making a request for a 1 pixel image on a CDN, as a really crappy health check.
This request was occasionally timing out, but very very rarely was reported in our Vercel logs (I still don't know why, maybe because it hung for so long, very strange that it effects other requests)

When it would timeout, even pages that were not related to where the request was made would also timeout (idk why this would happen seems like a bug on Vercel to me)

Anyways, I replaced this simple request with:
- GET -> HEAD request
- Cache layer for both Success and Failures
- React.cache to avoid multiple lookups in one invocation
- A fallback image for when it fails

Almost immediately the timeouts stopped, and I have had 0% errors and timeouts for 15 hours now. I would think a codemod or even an AI skill prompt could be written to identify external fetches like this when migrating, and save a lot of people headache. I had multiple developers, Opus 4.6, GPT 5.3 Codex, and more try to identify this issue over the course of 4 or more months, and no one could find it.

Only reason I found it was because someone reported a different issue, on a completely unrelated part of the app, and while debugging that issue I found a single log amongst thousands that looked odd. It was different from all other Vercel timeout errors, and was an actual "fetch timeout" error. My telemetry told me enough that I knew it could only be one of a few requests, and that led me to the solution here. Nightmare over.