Next.js Discord

Discord Forum

API for STT (audio to text transcription) or some free software?

Unanswered
Rex posted this in #help-forum
Open in Discord
RexOP
I'm italian and I'm searching for a STT or API that can transcribe audio files into text and timestamps. Do you know some?

89 Replies

this is pretty accurate and can transcribe with timestamps
you need to take care of file size and audio duration though
@@ts-ignore you need to take care of file size and audio duration though
RexOP
thanks! are them free right?
@Rex thanks! are them free right?
lol no but free if you got the specs to run it locally
@@ts-ignore lol no but free if you got the specs to run it locally
RexOP
ahaha understandable
do you mean like a server that runs locally?
yes
@@ts-ignore https://huggingface.co/openai/whisper-large-v3
RexOP
this one is free?
yes
you can run it locally
there's an inference API on huggingface which is free but pretty slow
you can see in next to readme
at right side
@@ts-ignore you can run it locally
RexOP
I'm kinda new to this...
Would you recommend some tutorial or an already made program?
@Rex I'm kinda new to this... Would you recommend some tutorial or an already made program?
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
)

def speech_to_text(filename, language=None):
    try:
        result = pipe(filename, generate_kwargs={"language": language if language else "english"})
        return result['text']
    except Exception as e:
        return f"Error: {str(e)}"

if __name__ == "__main__":
    import argparse
    
    parser = argparse.ArgumentParser(description="Transcribe audio file using Whisper model")
    parser.add_argument("filename", help="Path to the audio file")
    parser.add_argument("--language", help="Language of the audio (default: english)", default="english")
    
    args = parser.parse_args()
    
    transcription = speech_to_text(args.filename, args.language)
    print("Transcription:", transcription)
this uses nvidia GPU(is available) else cpu
this script will autodownload the model and run it
RexOP
woooah
so I just need to have Python installed or also pytorch?
and then I create a file and run it, right?
yes
you need the cuda version of pytorch
also that transformers package
RexOP
I've created a file in VSC (.py) and pasted the text you provided
@@ts-ignore you need the cuda version of pytorch
RexOP
where I can download them?
cuda is only available for windows
cuda is nvidia's tech
if you're on mac, you might have to tweak the script to use GPU or whatever your mac has
this probably won't run on NPU
RexOP
I'mon windowss
great
RexOP
so the version of CUDA i need to install is related to the GPU i have in my pc?
yes
but I think if you have latest drivers, install latest cuda version of pytorch
it should work
that's what I did
RexOP
oh thankss
I was struggling to find version, now I'm currently on huawei laptop, so I think I've the GPU integrated
maybe it's better to run it on desktop computer
if you don't have cuda, this script will use cpu
yeah
@@ts-ignore if you don't have cuda, this script will use cpu
RexOP
but in this case I need to install CPU or if i install gpu it will do it automatically ?
yeah cpu version
RexOP
I have this one on matebook 14s, so it's cpu right?
and package is pip?
@Rex I have this one on matebook 14s, so it's cpu right?
its gpu but its integrated one so just cpu version of pytorch
@Rex and package is pip?
if you used python.org to install python, yes
@@ts-ignore if you used python.org to install python, yes
RexOP
thanks, I'll try now :))
and the tranformer package?
@Rex and the tranformer package?
pip install transformers
@@ts-ignore pip install transformers
RexOP
thankss, I've installed everything and needed to enable long paths because I was getting errors, now only warnings...
Now I just need to run the file using py, is there a specific command with pytorch?
and for the language I've just changed the word 'english' to 'italian' in my code
if you want to tweak the options of it, you should take a look at model card on huggingface url above
@@ts-ignore just run the script by bash python main.py <path to audio file> --language italian
RexOP
I got all of this:
config.json: 100%|█████████████████████████████████████████████████████████████████| 1.27k/1.27k [00:00<?, ?B/s]
C:\Users\simon\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\huggingface_hub\file_download.py:147: UserWarning: `huggingface_hub` cache-system uses 
symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\simon\.cache\huggingface\hub\models--openai--whisper-large-v3. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. 
In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
  warnings.warn(message)
Traceback (most recent call last):
  File "D:\Programming\Personal\SbobAI\main.py", line 8, in <module>
    model = AutoModelForSpeechSeq2Seq.from_pretrained(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\simon\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\transformers\models\auto\auto_factory.py", line 564, in from_pretrained        
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\simon\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\transformers\modeling_utils.py", line 3372, in from_pretrained
    raise ImportError(
ImportError: Using `low_cpu_mem_usage=True` or a `device_map` requires Accelerate: `pip install 'accelerate>=0.26.0'`
1 error I think because I wasn't in Administrator mode
it tells you to install a package
also turn on developer mode
@@ts-ignore also turn on developer mode
RexOP
I feel embarassed but I really don't know what it is
it's pretty slow, but I think it's based on my connection
model.safetensors:  15%|███████▊                                            | 461M/3.09G [01:03<05:35, 7.83MB/s]
what are you using it for if I can ask? :))
I tried to make a language learning app
it was less of learning but more of practicing
RexOP
with which stack?
I used this model to take your voice input > convert to text and feed that text to gemini/gpt to get response/feedback
RexOP
that's great! And how's it going?
@Rex with which stack?
@Rex that's great! And how's it going?
its going pretty good
RexOP
I love that!
I'm Italian and was in Sweden trying to learn swedish (which is really difficult for our latin language, similar to german) using Duolingo..
I've found the same struggles you mention in your Motivation section
and how do you get to this point if I can ask? (self-taught or school) :))
I stopped giving more time to duolingo and start giving more time to learning myself by resources online like books, A1 videos etc and practice in my app
@@ts-ignore I stopped giving more time to duolingo and start giving more time to learning myself by resources online like books, A1 videos etc and practice in my app
RexOP
Love this, I'm also passionate in this learning area and 1 year ago tried to make an app to learn words and upgrade your vocabulary
lets not pollute this chat and continue in #off-topic :)
RexOP
yess, I just have one error in the code:
model.safetensors: 100%|███████████████████████████████████████████████████| 3.09G/3.09G [06:42<00:00, 7.67MB/s]
generation_config.json: 100%|██████████████████████████████████████████████| 3.90k/3.90k [00:00<00:00, 7.82MB/s]
preprocessor_config.json: 100%|████████████████████████████████████████████████████████| 340/340 [00:00<?, ?B/s]
tokenizer_config.json: 100%|█████████████████████████████████████████████████| 283k/283k [00:00<00:00, 1.09MB/s]
vocab.json: 100%|██████████████████████████████████████████████████████████| 1.04M/1.04M [00:00<00:00, 1.97MB/s]
tokenizer.json: 100%|██████████████████████████████████████████████████████| 2.48M/2.48M [00:00<00:00, 4.93MB/s]
merges.txt: 100%|████████████████████████████████████████████████████████████| 494k/494k [00:00<00:00, 1.55MB/s]
normalizer.json: 100%|█████████████████████████████████████████████████████| 52.7k/52.7k [00:00<00:00, 45.3MB/s]
added_tokens.json: 100%|███████████████████████████████████████████████████| 34.6k/34.6k [00:00<00:00, 1.04MB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████| 2.07k/2.07k [00:00<?, ?B/s]
Transcription: Error: ffmpeg was not found but is required to load audio files from filename
install ffmpeg, add to path and restart your laptop
I did it, and now it's running, I'll tell you if it worked after it finished 🙌
just got these warnings:
Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\transformers\models\whisper\generation_whisper.py:496: FutureWarning: The input name `inputs` is deprecated. Please make sure to use `input_features` instead.
  warnings.warn(
You have passed language=italian, but also have set `forced_decoder_ids` to [[1, None], [2, 50360]] which creates a conflict. `forced_decoder_ids` will be ignored in favor of language=italian.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
unfortunately it stopped without the output of "Transcription:"