Transcribe video lessons

Starting from mp4 video, extract audio and transcribe the audio to text files


In this tutorial, we will see how to get the audio transcriptions (text files) from a batch of mp4 videos.
The aim is to help students to get the transcript from teachers online courses, using one of the best black box ML technique Google Speech to text API.
Note: maybe a better solution for speech to text exist for English, but the example here is from Italian lessons.

Extract audio from the mp4

We assume that in the video only the teacher speaks, so we will extract a mono channel.

# Get a wav file for each mp4 file found on current directory:
❯ for FILE in *.mp4; do ffmpeg -i $FILE -acodec pcm_s16le -ac 1 -ar 16000 "${FILE%.*}".wav ; done
  • Tip: normalize the file names with
    ❯ detox .

Initialize Google cloud platform (GCP)

  • Get the 300$ from the free tier link
  • Create a bucket (here named “example-sbobinate”)
  • Enable the speech to text API

Upload wav to gcs

❯ gsutil -m cp * gs://example-sbobinate/test/
  • Tip: Slow upload? be sure the bucket location is near your region

Use the speech to text API

  • Log into GCP account
❯ gcloud init
  • Call the API and store the transcriptions
# File: ``
# Require gsutil, gcloud, jq

mkdir -p transcriptions

for FILE_PATH in $(gsutil ls "gs://example-sbobinate/test/"); do
  echo "Submit file $FILE_PATH"
  RUN_ID=$(gcloud ml speech recognize-long-running "$FILE_PATH" --language-code=it-IT --async | jq -r .name)

  echo "Run id: $RUN_ID"
  echo "OUTPUT: $OUTPUT"

  gcloud ml speech operations wait $RUN_ID >"$OUTPUT"
  echo "-------------"

Parse and store the transcriptions

  • Parse all the json received from Google API speech
# File: ``

mkdir -p ./transcriptions/only_text/

for FILE in ./transcriptions/*; do
  echo "Start working on $FILE..."

  echo "OUTPUT: $OUTPUT"
  echo "" >$OUTPUT  # create the file

  RESULTS=$(cat "$FILE" | jq .results) # get the transcriptions

  for row in $(echo "${RESULTS}" | jq -r '.[] | @base64'); do
    TRANSCRIPTION=$(echo ${row} | base64 --decode | jq -r ${1} | jq '.[]|first' | jq .transcript) # Isolate only the text of the 1st alternative

Check the results

  • Check the video transcriptions under ./trascriptions/only_text/


  • Google recognize-long-running API