
I am downloading WebVTT files from youtube using youtube-dl.

A typical file looks like this:

Kind: captions
Language: en

00:00:00.730 --> 00:00:05.200 align:start position:0%


00:00:05.200 --> 00:00:05.210 align:start position:0%

00:00:05.210 --> 00:00:11.860 align:start position:0%
hi<00:00:06.440><c> I'm</c><00:00:07.440><c> here</c><00:00:07.740><c> to</c><00:00:08.160><c> talk</c><00:00:08.429><c> to</c><00:00:09.019><c> share</c><00:00:10.019><c> an</c><00:00:10.469><c> idea</c><00:00:10.820><c> to</c>

00:00:11.860 --> 00:00:11.870 align:start position:0%
hi I'm here to talk to share an idea to

00:00:11.870 --> 00:00:15.890 align:start position:0%
hi I'm here to talk to share an idea to
communicate<00:00:12.920><c> but</c><00:00:13.920><c> what</c><00:00:14.790><c> is</c><00:00:14.940><c> communication</c>

00:00:15.890 --> 00:00:15.900 align:start position:0%
communicate but what is communication

I would like to get a text file with this:

hi I'm here to talk to share an idea to
communicate but what is communication

Using code I found online, I got this:

cat output.vtt | sed "s/^[0-9]*[0-9\:\.\ \>\-]*//g" | grep -v "^WEBVTT\|^Kind: cap\|^Language" | awk 'BEGIN{ RS="\n\n+"; RS="\n\n" }NR>=2{ print }' > dialogues.txt

But it is far from perfect. I get a lot of useless spaces, and all the sentences are displayed twice. Would you mind helping me? Somebody asked a similar question before but the answer submitted did not work for me.



3 Answers


You might be able to do something similar to this:

sed -e '1,4d' -E -e '/^$|]|>$|%$/d' output.vtt | awk '!seen[$0]++' > dialogues.txt
  • sed removes the first 4 lines
  • sed then deletes any blank lines, or ones that contain ], or end in >, %.
  • awk removes duplicate lines.


hi I'm here to talk to share an idea to
communicate but what is communication 

You might have to tweak it a bit, although it should result in more along the lines of what you want.


Could you please try following in a single awk itself.

awk 'FNR<=4 || ($0 ~ /^$|-->|\[|\]|</){next} !a[$0]++'  Input_file

Explanation: Adding explanation now for above code.

awk '                                     ##Starting awk program here.
FNR<=4 || ($0 ~ /^$|-->|\[|\]|</){        ##Checking condition if line number is less than 4 OR having spaces or [ or ] or --> then go next line.
  next                                    ##next will skip all further statements from here.
!a[$0]++                                  ##Creating an array whose index is $0 and increment its value with 1 with condition that it should NOT be already present in array a, which means it will give only 1 value of each line.
'  Input_file                             ##Mentioning Input_file name here.

If you analyze the pattern of your .vtt file, basically you want to keep every 8th line starting at line 10. So the algorithm is to delete the first 2 lines, then keep every 8th line:

$ cat output.vtt | sed '1,2 d' | awk 'NR%8==0'

hi I'm here to talk to share an idea to
communicate but what is communication
  • sed '1,2 d' deletes range from line 1 to line 2
  • awk 'NR%8==0' prints every 8th line

If you want to further filter out the "[...]" lines, then you can add another grep command such as grep -v '^\[.*\]$'