Converting speech to text is a difficult technological problem, especially if you can't train the speech recognition software. Here's a video that illustrates how YouTube's audio transcription works for novels (also check the original video):
The results are terrible, but you should take into account that auto-captioning works best for speeches. There are many hilarious mistakes: "George Orwell" is recognized as "but it wasn't", "Lolita" is converted to "don't think so", "the hobbit" is recognized as "the hall", while "cold day" is converted to "cocaine".
And if that's not enough, try to enable auto-captioning for the video embedded above. "This goes on a infinite loop... the transcribe audio function applied to this version transforms entire non-sense phrases into single words," comments RequiemPipes.
{ Thanks, Richard. }