Sorted by New

Wiki Contributions



When you exhaust all the language data from text, you can start extracting language from audio and video.

As far as I know the largest public repository of audio and video is YouTube. We can do a rough back-of-the-envelope computation for how much data is in there:

  • According to some 2019 article I found, in every minute 50 hours of video are uploaded to YouTube. If we assume this was the average for the last 15 years, that gets us 200 billion minutes of video.
  • An average conversation has 150 words per minute, according to a Google search. That gets us 30T words, or 30T tokens if we assume 1 token per word (is this right?)
  • Let's say 1% of that is actually useful, so that gets us 300B tokens, which is... a lot less than I expected.

So it seems like video doesn't save us, if we just use it for the language data. We could do self-supervised learning on the video data, but for that we need to know the scaling laws for video (has anyone done that?).