ahmetky.dev

  • Home
  • Resume
  • Blog
  • Contact
  • Home
  • Resume
  • Blog
  • Contact
  • How to Fetch Thousands of YouTube Transcripts Safely and Efficiently

    How to Fetch Thousands of YouTube Transcripts Safely and Efficiently

    July 10, 2025

    A deep dive into building a scalable YouTube transcript fetcher using threading, proxies, realistic headers, retries, and clean Python code.

    Fetching transcripts from YouTube might seem simple—until you try to do it thousands of times in a row.

    If you're building a dataset for machine learning, training an LLM, or just want to collect massive amounts of YouTube content, you need a reliable, efficient, and ban-resistant way to fetch transcripts.

    In this post, I’ll walk you through how I designed and implemented a system that does just that. We'll cover the key Python techniques and tools I used, explain the trade-offs, and show you how to structure your code cleanly using Pydantic types.


    Why This Is Harder Than It Looks

    The YouTube Transcript API (an unofficial wrapper) works great at small scale. But if you try to send too many requests from the same IP in a short amount of time, YouTube may:

    • Temporarily block you (IP ban)
    • Serve you rate-limited or incomplete responses
    • Flag your requests as bot-like behavior

    To fetch thousands of transcripts reliably, we need to:

    • Use realistic browser-like headers
    • Respect timeouts and error types
    • Add retries for recoverable failures (like rate limiting)
    • Use concurrent threads to speed things up
    • Support proxies to avoid getting banned

    Step 1: Mimic Real Browsers with Headers

    Using fake_useragent, we generate a rotating set of headers that make our requests look more like those coming from a real browser.

    from fake_useragent import UserAgent
    import random
    
    ACCEPT_LANGUAGES = ["en-US,en;q=0.9", "fr-FR,fr;q=0.9"]
    REFERERS = ["https://www.youtube.com/", "https://www.google.com/"]
    
    def get_realistic_headers() -> dict:
        ua = UserAgent()
        return {
            "User-Agent": ua.random,
            "Accept": "text/html,application/xhtml+xml,...",
            "Accept-Language": random.choice(ACCEPT_LANGUAGES),
            "Referer": random.choice(REFERERS),
            # Additional modern headers can go here
        }
    

    And we can use our headers like this with youtube-transcript-api since it is allowed to overwrite its default Session.

    headers = get_realistic_headers()
    httpx_client = httpx.Client(timeout=TRANSCRIPT_FETCH_TIMEOUT, headers=headers)
    
    ytt_api = YouTubeTranscriptApi(http_client=httpx_client) # Here we are initializing our custom Session.
    

    This is important because YouTube’s systems look at request patterns, headers, and fingerprints. Without proper headers, your script will get flagged quickly.


    Step 2: Getting Data From API

    Here how we can fetch youtube transcripts with youtube-transcript-api with video_id paramater. Also notice that we are using to_raw_data function.

    You can get more information from here: youtube-transcript-api

    from youtube_transcript_api import YouTubeTranscriptApi
    from youtube_transcript_api._errors import NoTranscriptFound, VideoUnavailable, TranscriptsDisabled
    
    def fetch_transcript_with_snippet(video_id: str) -> dict | None:
        try:
            ytt_api = YouTubeTranscriptApi(http_client=httpx_client)
            transcript = ytt_api.fetch(video_id).to_raw_data()
    
            return {
                "video_id": video_id,
                "transcript": transcript,
            }
        except (NoTranscriptFound, VideoUnavailable, TranscriptsDisabled):
            # You can print custom message here.
            return None
        except Exception as e:
            print(f"⚠️ Unexpected error: {e}")
            return None
    

    This will return us a list of data format like this:

    {
        "video_id": 1234,
        "transcript": {
            [
                {
                    'text': 'Hey there',
                    'start': 0.0,
                    'duration': 1.54
                },
                {
                    'text': 'how are you',
                    'start': 1.54
                    'duration': 4.16
                },
                # ...
            ]
        }
    }
    
    

    Step 3: Add Retry Logic for IP Blocks

    Using tenacity, we can automatically retry requests when certain errors occur—like getting blocked by YouTube.

    from tenacity import retry, wait_fixed, stop_after_attempt, retry_if_exception_type
    from youtube_transcript_api._errors import IpBlocked
    
    @retry(retry=retry_if_exception_type(IpBlocked), wait=wait_fixed(1), stop=stop_after_attempt(2))
    def fetch_transcript(video_id: str, snippet):
        api = YouTubeTranscriptApi(http_client=httpx_client)
        transcript = api.fetch(video_id).to_dict()
        return {"video_id": video_id, "transcript": transcript, "snippet": snippet.dict()}
    

    Tenacity handles retries cleanly without having to write verbose try-except loops.


    Step 3: Use Threads + Async to Fetch at Scale

    Before we get into this step I have assuming you have your video_ids available. You can get video id's with various ways but I personally used Youtube V3 API which is official youtube API.

    You can access this API in Google Cloud Console.

    Or you just can simply use dummy list of video id's like this for now:

    video_ids = ['123', '1234', '12345']
    

    Python’s asyncio.to_thread lets you run sync code (like the transcript API) inside threads, controlled by asyncio. This will speed up fetching process greatly which is essential for channels that have hundreds of videos.

    Here how it is looks in python:

    from concurrent.futures import ThreadPoolExecutor
    import asyncio
    
    executor = ThreadPoolExecutor(max_workers=30)
    
    async def fetch_all(video_ids: list[str]) -> list:
        async def run(vid):
            return await asyncio.to_thread(fetch_transcript, vid)
        tasks = [run(v) for v in video_ids]
        results = await asyncio.gather(*tasks)
    

    This allows you to process hundreds of transcripts simultaneously—without blocking.


    Step 4: Avoid Getting Banned with Proxies

    Eventually, even with good headers, your IP will get flagged. That’s why proxies are essential for long-term, high-volume scraping.

    You can plug proxy support into httpx like this:

    httpx_client = httpx.Client(
        proxies="http://username:password@proxy_ip:port",
        headers=get_realistic_headers(),
        timeout=10
    )
    

    You can rotate proxies for each request, or use a service like:

    • Bright Data
    • ScraperAPI
    • Smartproxy
    • requests-ip-rotator (for AWS setups)

    This will significantly improve reliability and longevity of your fetcher.


    Step 5: Keep Output Clean With Pydantic

    For reliable data exchange and downstream ML tasks, use Pydantic to define your schema.

    from pydantic import BaseModel
    
    class FetchAndMetaResponse(BaseModel):
        video_id: str
        transcript: list[dict]
    

    Then, wrap each result like this:

    from app.types.youtube import FetchAndMetaResponse # Your folder structure could be different.
    
    results = await asyncio.gather(*tasks)
    return [
        FetchAndMetaResponse(
            video_id=result["video_id"],
            transcript=result["transcript"],
        )
        for result in results if result
    ]
    

    This makes your fetcher’s output robust and compatible with any future APIs or file exports (CSV, JSON, etc.).


    Full Code

    from concurrent.futures import ThreadPoolExecutor
    from youtube_transcript_api import YouTubeTranscriptApi
    from youtube_transcript_api._errors import NoTranscriptFound, VideoUnavailable, TranscriptsDisabled, IpBlocked
    from tenacity import retry, wait_fixed, retry_if_exception_type, stop_after_attempt
    from app.types.youtube import FetchAndMetaResponse
    from app.lib.timeout import TRANSCRIPT_FETCH_TIMEOUT
    from app.lib.defenses.headers import get_realistic_headers
    from typing import Literal
    import asyncio
    import httpx
    
    headers = get_realistic_headers()
    httpx_client = httpx.Client(timeout=TRANSCRIPT_FETCH_TIMEOUT, headers=headers)
    
    # Global API and thread pool
    executor = ThreadPoolExecutor(max_workers=30)
    
    @retry(
        retry=retry_if_exception_type(IpBlocked),
        wait=wait_fixed(1),
        stop=stop_after_attempt(2)
    )
    def fetch_transcript_with_snippet(video_id: str, snippet: Snippet, progress_id: str, max_results: int) -> dict | None:
        try:
            ytt_api = YouTubeTranscriptApi(http_client=httpx_client)
            transcript = ytt_api.fetch(video_id).to_raw_data()
    
            return {
                "video_id": video_id,
                "transcript": transcript,
            }
        except (NoTranscriptFound, VideoUnavailable, TranscriptsDisabled):
            return None
        except Exception as e:
            print(f"⚠️ Unexpected error: {e}")
            return None
    
    async def fetch_all_transcripts_with_metadata(video_ids: list[str], snippets: list[Snippet], progress_id: str) -> list[FetchAndMetaResponse]:
        async def run_in_thread(vid):
            return await asyncio.to_thread(fetch_transcript_with_snippet, v)
    
        tasks = [run_in_thread(vid) for vid(video_ids)]
        results = await asyncio.gather(*tasks)
    
        return [
            FetchAndMetaResponse(
                video_id=result["video_id"],
                transcript=result["transcript"],
            )
            for result in results if result
        ]
    

    Final Thoughts

    This setup has allowed me to fetch tens of thousands of transcripts from YouTube for my AI projects without getting blocked. It’s fast, safe, and extensible.

    To recap:

    • Use headers and proxies to mimic real users
    • Add retry logic for resilience
    • Use threading and async for performance
    • Structure responses using Pydantic

    Let me know if you'd like me to show how to integrate this with FastAPI, cache results, or export to CSV.

    A

    ahmetky.dev

    © 2025 Ahmet K. All rights reserved.

    HomeAboutProjectsSkillsContact