Next.js Discord

Discord Forum

Not able to scrape text (node fetch nor axios) with serverless function

Answered
Pond loach posted this in #help-forum
Open in Discord
Avatar
Pond loachOP
When trying to make a request from the serverless function, i get this error, but it works fine on localhost
- info Loaded env from /var/task/.env
Received a POST request to api/context/link
Error: Runtime exited with error: signal: segmentation fault
Runtime.ExitError

lmk if you want to see the function or anything! Thank you!!
Image
Answered by Z4NR34L
ok, I see that there are a few things that possible would not work in Vercel environment, for such jobs I'm using most basic fetch and node-html-parser, here is an example for one of my endpoints:

import {
  NextRequest,
  NextResponse
} from "next/server";
import { parse } from 'node-html-parser';

export async function POST(request: NextRequest) {
  const res = await fetch('https://example.com/smth');
  const bodyHtml = parse(await res.text())
  return NextResponse.json({
    productName: bodyHtml.querySelector('.c-product-info__name')?.text,
    offers: bodyHtml.querySelectorAll('.c-offers-list__cont:nth-child(2) > section').map((element) => ({
      store: element.querySelector('.c-offer__shop-logo-cont')?.attributes['aria-label'].replace('Do obchodu ', ''),
      price: element.querySelector('.c-offer__price')?.text
    }))
  })
}
View full answer

42 Replies

Avatar
Pond loachOP
anyone have any idea 🙏
Avatar
Hi, could you provide us with this route's code? It would be hard to help without that in that case.
Avatar
od repository link if it's public 😄
Avatar
Pond loachOP
async function scrapeLink(
  link: string
): Promise<{ siteTitle: string; siteContent: string }> {
  try {
    console.log("scraping link:", link);
    const response = await axios.get(link, {
      headers: {
        "User-Agent":
          "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
      },
    });
    console.log("response:", response);
    const dom = new JSDOM(response.data, { url: link });
    let reader = new Readability(dom.window.document);
    let site = reader.parse();
    if (!site) {
      throw new Error("Unable to parse article");
    }
    const siteTitle = site.title;
    let oldSiteContent = site.content;
    const $ = cheerio.load(oldSiteContent);
    oldSiteContent = $("body").text();
    console.log("oldSiteContent:", oldSiteContent);
    const siteContent = oldSiteContent;

    return { siteTitle, siteContent };
  } catch (error) {
    console.error("Error scraping the link:", error);
    throw error; // Re-throw the error to be handled by the caller
  }
}
@Z4NR34L
you are a legend sir
Avatar
are you using edge runtime or node? 😄
Avatar
Pond loachOP
i don't have edge defined at the top of the route
im using export const dynamic = "force-dynamic";
Avatar
and I will need a whole route handler code
Avatar
Pond loachOP
ok
Image
Avatar
ok, I see that there are a few things that possible would not work in Vercel environment, for such jobs I'm using most basic fetch and node-html-parser, here is an example for one of my endpoints:

import {
  NextRequest,
  NextResponse
} from "next/server";
import { parse } from 'node-html-parser';

export async function POST(request: NextRequest) {
  const res = await fetch('https://example.com/smth');
  const bodyHtml = parse(await res.text())
  return NextResponse.json({
    productName: bodyHtml.querySelector('.c-product-info__name')?.text,
    offers: bodyHtml.querySelectorAll('.c-offers-list__cont:nth-child(2) > section').map((element) => ({
      store: element.querySelector('.c-offer__shop-logo-cont')?.attributes['aria-label'].replace('Do obchodu ', ''),
      price: element.querySelector('.c-offer__price')?.text
    }))
  })
}
Answer
Avatar
And 2nd thing - POST requests are not cached anyways, you don't need to use force-dynamic 😄
Avatar
Pond loachOP
ohhh ty
trying it rn
Avatar
I'm not using maxDuration as those functions are pretty fast 😄
and I'm not looping anything - TBH looping in serverless endpoints is pure pain 😄
Avatar
Pond loachOP
ok working on localhost
my git commit is stuck, so will test on vercel in a sec
Avatar
sure, let me know 😄
normal sites work
i forgot user agent !
ok i think i got it
one last test
Avatar
Remember that there can be used various anti-bot solutions on websites that you fetch, some of them would not be possible to scrape from Vercel/AWS/GCP/Azure infrastructure 😄
Avatar
Pond loachOP
also btw do u work for vercel or something
ofc!
Avatar
I wish haha, I'm just Software Engineer with many projects on Vercel
Avatar
Pond loachOP
nice
Avatar
I'm working mostly with Business Intelligence class systems based on Next.js + Vercel
Avatar
Pond loachOP
ok it works
oh very cool
🫡
i also found you on twitter 🐣
Avatar
If you don't mind I will left here link to my blog, where you can find more about Next or Vercel haha

https://www.zanreal.net/blog
Avatar
Pond loachOP
ok sweet i'll check it out
Avatar
Happy coding! 😄
Avatar
Pond loachOP
clean ass site damn
Avatar
there is still rebranding in progress :lolsob:
Avatar
Pond loachOP
looks very cool