Pocket follows a long line of bookmarking services that allow you to tag and save web pages, as a bonus it will convert the web page into a clean ad free and very readable version.
Currently Pocket allows you to save text-based content. This proposal is concerned with ways of extending it to spoken-word audio and video.
A little bit of background – I’m a technologist who specialises in the application of machine transcription. I’m also a compulsive hoarder of web pages. For nearly ten years now I’ve been collaborating with others on a technology dubbed Hyperaudio and along the way co-founded a company called Trint which helps you edit a machine-generated transcript while maintaining word timings.
Most of my work concerns timed or interactive transcripts - I’m a developer at heart but as I become more involved in the product side of things I spend more time writing than coding.
I’ve been involved with this thing we used to call the World Wide Web for a while now and remember a number of similar applications to Pocket - Delicious anyone? I was even loosely involved with something called Licorize developed by Open Lab here in my home town of Florence.
So what’s the itch I’m trying to scratch? What is the value I’m proposing the Pocket team add to their product?
Similarly to the way that Pocket creates a more accessible version of web pages, I’m proposing they should make more accessible versions of web-based audio and video, with added permanency for good measure.
In case you haven't guessed, the accessibility I'm talking about is a transcript – a timed transcript that is linked to the audio or video – an interactive transcript.
Broadly speaking we can:
Download a piece of video or audio onto a server, submit it to a speech-to-text algorithm and extract a transcript and associated timings.
Allow the user to fix any errors (or not - transcripts of good quality audio are usually very good and we’ve still added value).
Make the transcript interactive (this is the promise of an open source library we’ve created called Hyperaudio-Light).
Keeping to the spirit of “just because you can do something, doesn’t mean you should” – let’s tackle the why.
In a word – accessibility. Spoken-word audio and video become much more accessible once you transcribe it, if you link the text to timepoints to corresponding parts of the media it becomes an order of magnitude more accessible and this applies not just to people with disabilities or restrictions on bandwidth speed – which of course is fundamental, it applies to all of us. There’s a lot of podcasting and vlogging going on and associating a transcript with these pieces of media means we can all more easily:
Navigate (by scrolling through the transcript and clicking on those parts that interest us)
Search (it’s all text now)
Share (the same way we share text but with added timecodes pointing to the original)
Bookmark (especially excerpts)
OK so ...
Mozilla acquired Pocket a while back in an attempt to diversify revenue streams.
Mozilla is also working on an amazing initiative called Common Voice which aims to create an open data set that anyone can use to train speech-to-text engines. Mozilla is using it on DeepSpeech.
Open source software often inspired and supported by their initiatives is available to create and even correct interactive transcripts. Not to mention libraries to allow rich media embedding on social media.
It scales! Although I’m sure Mozilla doesn't lack the resources required to run hungry speech-to-text algorithms, the more people bookmarking audio and video, the more likely it is that it has already been transcribed and perhaps even corrected.
So that’s it?
Yup, that’s all I really wanted to write. Get it out there for discussion and be of use if I can be. I support Mozilla in their mission and of course I’m always looking for ways to apply Hyperaudio technology.
I’m maboa on Twitter - my DMs are open and I have a weekly newsletter over at https://tinyletter.com/maboa
"DJI OSMO Pocket" by TheBetterDay is licensed under CC BY-ND 2.0