A Dataset for Teaching and Evaluating RAG
As a fan of Acquired (https://www.acquired.fm/), I recently published a dataset containing 200 Acquired Podcast Transcripts with metadata, complete with a human-generated Q&A file (see the dataset at Kaggle).
This dataset was used in my Introduction to Generative AI course to teach and evaluate Retrieval-Augmented Generation (RAG). The 200 transcripts contain approximately 3.5 million words, which is equivalent to about 5,500 pages when formatted as a Word document.
I tasked each student with listening to an episode of their choice and then coming up at least three question-answer pairs to test the accuracy of the answers using both GPT-4 and GPT-4 with the transcript. The results, shown in the figure below, demonstrate that RAG significantly improved answer accuracy.
I want to thank Rain and Eric from my team at Takin.AI (https://takin.ai/) to collect and clean the data and my students for creating the QA file.
PS. The featured image for this post is generated using HiddenArt tool from Takin.ai.
Originally published at https://harrywang.me on June 15, 2024.