is there a way to have a tmp directory on the server?
Answered
Dutch Smoushond posted this in #help-forum
Dutch SmoushondOP
^
Answered by Ray
we could use
https://github.com/vercel/vercel/discussions/5320#discussioncomment-110775
/tmp on vercel for temporary storagehttps://github.com/vercel/vercel/discussions/5320#discussioncomment-110775
75 Replies
Dutch SmoushondOP
[OfficeParser]: Error: ENOENT: no such file or directory, mkdir 'officeParserTemp/tempfiles'
Trying to use officeparser through langchain
Trying to use officeparser through langchain
Toyger
you already have tmp, it's linux
so if you want to create here something then inside it like
but as you should understand temporary can mean as still as request is happening, because vercel running on ephermal instances, so on invokation of next request this folder can be already deleted.
/tmp folderso if you want to create here something then inside it like
/tmp/officeParserTemp/tempfilesbut as you should understand temporary can mean as still as request is happening, because vercel running on ephermal instances, so on invokation of next request this folder can be already deleted.
@Toyger you already have tmp, it's linux `/tmp` folder
so if you want to create here something then inside it like
`/tmp/officeParserTemp/tempfiles`
but as you should understand temporary can mean as still as request is happening, because vercel running on ephermal instances, so on invokation of next request this folder can be already deleted.
Dutch SmoushondOP
Thank you for the response, so what do you think would work in this instance? Im using package https://js.langchain.com/docs/integrations/document_loaders/file_loaders/pptx which I believe is built on top of this package https://github.com/harshankur/officeParser#readme which states the need for the tmp folder
@Dutch Smoushond Thank you for the response, so what do you think would work in this instance? Im using package https://js.langchain.com/docs/integrations/document_loaders/file_loaders/pptx which I believe is built on top of this package https://github.com/harshankur/officeParser#readme which states the need for the tmp folder
Toyger
probably it is using it, but they didn't expose temp folder option, you either use your own implementation where you'll expose it, either ask in langchain issues can they expose it as option to customize temp folder location
American Crow
i just read over the docs (not carefully) you sure you need a temp folder? i don't see that part. Can you not just simply read a file from public or whatver?
@Toyger probably it is using it, but they didn't expose temp folder option, you either use your own implementation where you'll expose it, either ask in langchain issues can they expose it as option to customize temp folder location
Dutch SmoushondOP
I could maybe just use officeparser directly
This is in the officeparser github
Is aws s3 a solution? Ive read some have used that
@American Crow i just read over the docs (not carefully) you sure you need a temp folder? i don't see that part. Can you not just simply read a file from public or whatver?
Dutch SmoushondOP
Also, if you dig in to the langchain node_modules, you'll find
A method that takes a `raw` buffer and `metadata` as parameters and
* returns a promise that resolves to an array of `Document` instances. It
* uses the `parseOfficeAsync` function from the `officeparser` module to extract
* the raw text content from the buffer. If the extracted powerpoint content is
* empty, it returns an empty array. Otherwise, it creates a new
* `Document` instance with the extracted powerpoint content and the provided
* metadata, and returns it as an array.American Crow
you right Duke i found the issue:
https://github.com/langchain-ai/langchainjs/issues/4000
https://github.com/langchain-ai/langchainjs/issues/4000
Dutch SmoushondOP
The pdf loader works fine tho, is it writing to the tmp without a problem? Never knew that
@American Crow you right Duke i found the issue:
https://github.com/langchain-ai/langchainjs/issues/4000
Dutch SmoushondOP
Unfortunate that they still never addressed this
American Crow
yea sorry can't really help
@Dutch Smoushond Thank you for the response, so what do you think would work in this instance? Im using package https://js.langchain.com/docs/integrations/document_loaders/file_loaders/pptx which I believe is built on top of this package https://github.com/harshankur/officeParser#readme which states the need for the tmp folder
could you show some code on how are you using it?
@Ray could you show some code on how are you using it?
Dutch SmoushondOP
Yah, the file here is a file object that comes from the frontend, basically doing exactly as that docs
import { PPTXLoader } from "langchain/document_loaders/fs/pptx";
const loader = new PPTXLoader(file);
const docs = await loader.load();
import { PPTXLoader } from "langchain/document_loaders/fs/pptx";
const loader = new PPTXLoader(file);
const docs = await loader.load();
@Ray on page component?
Dutch SmoushondOP
in api folder
on the backend
@Dutch Smoushond in api folder
add this to your
next.config.js/** @type {import('next').NextConfig} */
const nextConfig = {
experimental: {
serverComponentsExternalPackages: ["officeparser"],
},
};
module.exports = nextConfig;after that, this code works for me
import path from "path";
import { PPTXLoader } from "langchain/document_loaders/fs/pptx";
export async function GET() {
const loader = new PPTXLoader(path.join(process.cwd(), "test.docx"));
const docs = await loader.load();
return Response.json(docs);
}blob work too
const filePath = path.join(process.cwd(), "test.docx");
const buffer = await fs.readFile(filePath);
const loader = new PPTXLoader(new Blob([buffer]));
const docs = await loader.load();@Ray blob work too
ts
const filePath = path.join(process.cwd(), "test.docx");
const buffer = await fs.readFile(filePath);
const loader = new PPTXLoader(new Blob([buffer]));
const docs = await loader.load();
Dutch SmoushondOP
async function readPPT(file) {
const filePath = path.join(process.cwd(), "test.docx")
const buffer = fs.readFile(filePath)
const loader = new PPTXLoader(new Blob([buffer]))
const docs = await loader.load()
return docs
}
this is my code, the file comes in as a parameter
const filePath = path.join(process.cwd(), "test.docx")
const buffer = fs.readFile(filePath)
const loader = new PPTXLoader(new Blob([buffer]))
const docs = await loader.load()
return docs
}
this is my code, the file comes in as a parameter
Where does the file go in to that code?
what is the type of the file?
Dutch SmoushondOP
Its an object
object of what? could you log it out?
PPTXLoader accept a blob or string
either the blob of the file or the path of the file
Dutch SmoushondOP
A File object is a specific kind of Blob, and can be used in any context that a Blob can.
According to mdn
According to mdn
then try
new PPTXLoader(file)Dutch SmoushondOP
Taking some time because my code is all messed up from trying a bunch of different ways to solve this
This is what the file looks like
file: File {
size: 647237,
type: 'application/vnd.openxmlformats-officedocument.presentationml.presentation',
name: 'Dickinson_Sample_Slides.pptx',
lastModified: 1710868377332
}
file: File {
size: 647237,
type: 'application/vnd.openxmlformats-officedocument.presentationml.presentation',
name: 'Dickinson_Sample_Slides.pptx',
lastModified: 1710868377332
}
This is what the function looks like currently
async function readPPT(file) {
const loader = new PPTXLoader(file)
const docs = await loader.load()
console.log({ docs })
return docs
}
async function readPPT(file) {
const loader = new PPTXLoader(file)
const docs = await loader.load()
console.log({ docs })
return docs
}
@Ray then try `new PPTXLoader(file)`
this works for me
@Ray this works for me
Dutch SmoushondOP
In production?
It works locally but not in production for me
where are you running in production
Dutch SmoushondOP
vercel
let me try it real quick
@Dutch Smoushond vercel
can't use PPTXLoader there because I don't know how to config the tempFilesLocation for officeParser
I use officeParser directly and it works
export async function POST(request: Request) {
const formData = await request.formData();
const file = formData.get("file") as File;
const docs = await officeParser.parseOfficeAsync(
Buffer.from(await file.arrayBuffer()),
{
tempFilesLocation: "/tmp",
}
);
// const loader = new PPTXLoader(file);
// const docs = await loader.load();
return Response.json(docs);
}Dutch SmoushondOP
how are you importing officeparser?
import path from "path";
import { PPTXLoader } from "langchain/document_loaders/fs/pptx";
import officeParser from "officeparser";
export async function POST(request: Request) {
const formData = await request.formData();
const file = formData.get("file") as File;
const docs = await officeParser.parseOfficeAsync(
Buffer.from(await file.arrayBuffer()),
{
tempFilesLocation: "/tmp",
}
);
// const loader = new PPTXLoader(file);
// const docs = await loader.load();
return Response.json(docs);
}we could use
https://github.com/vercel/vercel/discussions/5320#discussioncomment-110775
/tmp on vercel for temporary storagehttps://github.com/vercel/vercel/discussions/5320#discussioncomment-110775
Answer
if you need to use
PPTXLoader, I think you should ask them how to set the tempFilesLocation@Ray if you need to use `PPTXLoader`, I think you should ask them how to set the `tempFilesLocation`
Dutch SmoushondOP
Is there a limit on the file size that is being sent to the backend?
@Dutch Smoushond Is there a limit on the file size that is being sent to the backend?
are you talking about the body size?
Dutch SmoushondOP
Becuase one file worked but a larger one didnt
or /tmp
@Ray are you talking about the body size?
Dutch SmoushondOP
Yes
@Ray or /tmp
Dutch SmoushondOP
/tmp is supposedly 512mb according to the github discussion from above, which is plenty
@Ray https://nextjs.org/docs/app/api-reference/next-config-js/serverActions#bodysizelimit
Dutch SmoushondOP
Thank you, im gonna increase it a bit
@Ray body size is 1mb by default
Dutch SmoushondOP
Thank you for all your help! Been stuck on this for days and I hate being stuck on a bug, can't stop thinking about it
Dutch SmoushondOP
The smaller file worked, so I assume it has to do something with file size
cool
Dutch SmoushondOP
Getting this error
The maximum payload size for the request body or the response body of a Serverless Function is 4.5 MB. If a Serverless Function receives a payload in excess of the limit it will return an error 413: FUNCTION_PAYLOAD_TOO_LARGE. See How do I bypass the 4.5MB body size limit of Vercel Serverless Functions for more information.
and how much size did you set?
Dutch SmoushondOP
7 mb
i set 10mb but im thinking maybe I need to upgrade vercel ðŸ«
Dutch SmoushondOP
Its alright, I can just send back a react toast or maybe split the doc into two requests
you may need to upload the file to s3 and read it on api route for large file
@Ray you may need to upload the file to s3 and read it on api route for large file
Dutch SmoushondOP
Yah, I'll decide what to do later but 4.5 mb for a file is not small, I'll send a message to split up the file perhaps
Maybe there is a way to filter out all the extra stuff in a pptx file, I just need the words
Anyhow, thank you for your time
your welcome