Google's AudioOverview feature: a game changer in (scientific) communication?

Introduction

Under the somewhat euphemistic name ‘AudioOverview’ - perhaps more appropriately be labeled as 'Deep Dive' -, Google recently introduced a functionality with experimental status that initially appears unspectacular and almost hidden, but in my opinion, since the introduction of ChatGPT, is one of the most fascinating, suggestive, tempting - and therefore potentially most dangerous AI applications of all. Text and audio files in various formats can be uploaded to a separate (protected) area within the https://notebooklm.google.com/ service to be analysed there. An automatic textual summary is offered in the preferred language, and potential follow-up questions to the source are also pre-formulated. Above these questions, however, one can find a button in the web interface for generating an ‘audio overview’ - and this is quite something, as it generates a podcast within a few minutes, which presents the content and statements of the previously uploaded source(s) in a dialogue-dialectical manner.
The way in which this happens is remarkable to say the least: the two so-called ‘AI hosts’ - each modulated by pleasantly archetypal male and female voices - explain the respective content to each other and their fictitious listeners in dialectical alternation, so to speak. And they do so on an intellectual level with the original source, well-informed, not merely repeating or paraphrasing, but with their own emphases, sometimes their own analogies and examples - right up to their own conclusions, which by no means come across as absurd or drawn out, but as casual agreements in the - supposed - sense of the original. Before I go into these features in more detail, here is some background information on the AudioOverview feature, as far as I was able to discover it myself.

Background information

AudioOverview appears to be based on three projects or infrastructures: On Google's large-scale language model Gemini 1.5 Pro[1], on results of the SoundStorm[2] project for generating longer interactive audio sequences, and on the use of personas, i.e. models of senders and receivers triggered by prompts. As a so-called ‘large-context model’, Gemini 1.5 Pro is able to process up to 2 million tokens of context, which also makes it suitable for larger amounts of text and audio: ‘This means 1.5 Pro can process vast amounts of information in one go - including 1 hour of video, 11 hours of audio, codebases with over 30,000 lines of code or over 700,000 words. In our research, we've also successfully tested up to 10 million tokens."[1:1] One of the things that characterises this model is that it can memorise and process even the smallest details or utterances within such large amounts of data. Using the results from the StormSound project, it was apparently possible to generate longer audio sequences with speaker changes, and at a significantly higher speed compared to previous audio models or speech-to-text methods - the AI hosts themselves explain that only five seconds of computing time - presumably in a HPC environment - were required for 30 seconds of audio sequence when generating ‘their’ podcasts. [3] Finally, as far as the instructed and learnt personas as models for the recipients of the generated podcast are concerned, Jaden Geller revealed the principles behind the prompts through skilful questioning of the AI hosts.[4] On the one hand, the aim is not just to remain on the surface of a given topic, but, according to the transcript[3:1], to address the ‘golden nuggets of knowledge’ and to create ‘aha moments’. The podcast takes a selective approach, especially with larger sources, as I was able to experience with my own text - with different, but neither distorting nor contradictory emphases. Furthermore, the hidden system prompt is probably about conveying the persona of the recipient of the podcast to the model - ‘well, it's you. Or at least, the ideal version of you.' In other words, the model makes some basic assumptions with regard to an ideal-typical recipient, which in reality - and also with respect to scientific communication - are often enough not fulfilled: namely an unreserved thirst for knowledge, openness to alternative points of view and a willingness to learn. This is where the potentially idealistic-educational character of AI-based chat applications becomes apparent, insofar as they expose the content to the human recipient to let him or her make the final judgement: ‘But the goal is always to present a balanced view, as best as we can, with the information available, so that you, the listener, can ultimately form your own conclusions’.
There are also two other aspects that make the generated podcasts seem so realistic and suggestive to listeners. Firstly, the use of mental pauses (‘disfluencies’)[5] and filler words, as they are also referred to in linguistic terminology as discourse markers, discourse particles or hearsay markers[6], which have apparently been trained by learning existing podcasts, including interspersed signals of fundamentally affirmative discourse behaviour such as ‘exactly!' or 'absolutely!'. Secondly, the distribution of roles between the two AI hosts as fictitious dialogue partners is neither symmetrical nor stereotypical, but follows the pattern of complementarity: ‘Because it's not just about two people talking. These roles, they're designed to complement each other. You know, like two pieces of a puzzle. You've got the host, always enthusiastic, always highlighting the most intriguing points. And then you've got the expert, who comes in with the analysis, the context, the bigger picture.'[3:2] At least to my impression, these roles occasionally change between the speakers during the course of a podcast, probably also to avoid the impression of (gender) stereotypes.

Examples and impressions

I uploaded my own anonymised article for the AudioOverview, so I am neither mentioned nor presented as the author in the podcast.
Link to the article
The whole podcast created within a few minutes using Google's hosted NotebookLM and Gemini 1.5 Pro can be found here:

Playback speed:

Before I go into individual aspects, I would like to make a few general observations: As already described above, the fictitious dialogue follows the ideal-typical style of an in-depth conversation on a topic outlined at the beginning. In the exposition of this topic, the selection and deepening of individual aspects, the illustration through examples and analogies, and finally also in the concluding summary, the podcast proceeds quite independently - it is neither simply recited nor paraphrased, but - with the exception of technical terms and central concepts - largely formulated independently. The style is generally affirmative, at least for ethically undisputable topics: Approaches, concepts or statements in the article are not scrutinised or even disputed, but rather taken up by one host and readily and vividly explained by the other host, as in expert mode. Secondary literature cited in the article is oocasionally knitted into the conversation; however, it is no longer clear from the podcast that these were originally formal references:

Playback speed:

The article is an unsuccessful submission to two journals, i.e. a paper that has not yet been published - a fact that the podcast does not check in advance. Nevertheless - and this is another seductive effect of AI-generated text summaries for authors - I felt much better understood and appreciated by the podcast than by the reviews presumably written by people (as a side note, I would like to point out the possibility that reviews could be based on AI-generated assessments - or can even reproduce them). While the podcasts or audio overviews generated with NotebookLM for such articles usually last seven to eight minutes, the length of this podcast is about 15 minutes - which is probably also due to the length of the 16-page text template. The topic itself - ‘Transfer of research results into information services and practices: challenges and approaches to software-based library innovation’ - is introduced almost euphorically by the AI hosts, who then go on to talk about the main statements, examples and conclusions. In doing so, they occasionally use their own analogies, e.g. by mentioning continuous updates of smartphone apps as an illustration of a ‘perpetual beta’, or by emphasising aspects independently. For example, at one point ‘the expert’ refers to ‘knowledge distance’ as one of four contextual factors for knowledge transfer, although the text itself does not prioritise this:

Playback speed:

While the presentation of the case study at the end of the article or podcast is suggestively questioned ('I am a bit of a sucker for real-world examples... Do you have a case study from the paper that can maybe bring this to life a little more?'), the two hosts draw their own conclusion from the incomplete technology transfer in the case of the JournalMap - a conclusion that fits in much better with the fundamentally positive and constructive prompt of the AI podcast.

Concluding remarks

In my opinion, the AudioOverview feature as part of Google's NotebookLM is one of the most fascinating, amazing, seductive - and therefore also most dangerous - applications of generative AI based on a large foundation model in conjunction with your own documents or content. The form, style and content of the generated ‘deep dive’ almost inevitably give the impression that two experts can deal with any given topic appropriately within a very short time and explain it to a wider audience. The ‘self-chosen’ analogies occasionally come across as somewhat sweeping and trivialising, but without compromising the fundamental impression of an appropriate debate.
Thus, I consider the AudioOverview feature to be quite suitable for conveying an existing, otherwise examined content in a lively and vivid way - but then, as a precondition, always in a recognisable and comprehensible context with the source provided as open access, as otherwise the fundamentally selective and tendentious character of the podcast could easily remain hidden.
Update: In the meantime (as of October 2024), there has been published a Python package called Podcastfy as open source, which seems to offer significantly more configuration options with regard to style, language, structure and length than the NotebookLM desktop application via local installation. I have not yet looked into this package, but still requiring the use of Gemini I assume that it is of comparable quality, at least in terms of content.


  1. https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/#gemini-15 ↩︎ ↩︎

  2. https://research.google/blog/soundstorm-efficient-parallel-audio-generation/ ↩︎

  3. https://gist.github.com/simonw/29db00b5646047e42c3f6782dc102962 ↩︎ ↩︎ ↩︎

  4. https://simonwillison.net/2024/Sep/29/notebooklm-audio-overview/ ↩︎

  5. https://en.wikipedia.org/wiki/Speech_disfluency ↩︎

  6. https://www.swr.de/swrkultur/wissen/aeh-aehm-genau-wozu-gibt-es-fuellwoerter-102.html ↩︎