MarkItDown: Everything Becomes Markdown

12 min read Tiếng Việt
Featured image for microsoft/markitdown — MarkItDown: Everything Becomes Markdown

TL;DR

  • What it solves: The per-format document parsing problem: PDF has one library, Word has another, PowerPoint a third, and each one fails differently on edge cases
  • Why it matters: LLMs parse Markdown natively; every token of raw binary, space-mangled PDF text, or HTML attribute soup you send instead is a token spent on noise rather than reasoning
  • Best for: Python developers building LLM pipelines, RAG ingestion systems, or AI agents that need to consume heterogeneous document collections
  • Main differentiator: One unified md.convert("file.ext") call that handles 20+ formats, preserves document structure, and optionally calls an LLM to describe embedded image content
  • Best use case: Replacing the PyMuPDF + python-pptx + xlrd + boto3 frankenstein stack you quietly assembled six months ago and no longer want to maintain

The pipeline already handled PDFs. Six weeks of pdfminer.six, a custom heading-detection heuristic I was privately embarrassed about, and a table fallback I kept out of the demo. Then the Slack message arrived: “Can it also handle the spreadsheets?”

I said yes because saying no in Slack is harder than it should be.

Three days later, the spreadsheets worked. Two days after that, the request was PowerPoint. I had become the person who maintains a different document library for every format the business happens to use.

The Babel Fish for Documents

MarkItDown is a Python library and CLI from Microsoft’s AutoGen team that converts files to Markdown. PDF, Word, PowerPoint, Excel, HTML, CSV, JSON, XML, EPUB, images, audio, ZIP archives, YouTube URLs — one call, one output format, every format.

One sentence a junior developer could repeat: MarkItDown converts any file to Markdown so LLM pipelines, RAG systems, and AI agents can read it without per-format parsing code.

The design philosophy is stated plainly in the README and worth repeating: this tool produces output for LLMs, not for human document archiving. Headings survive. Tables survive. Links survive. The precise font size of a slide title does not. Think of it as the universal power adapter for documents: wherever the data originated, it comes out the other end as clean Markdown that every major LLM already speaks natively.

The question is whether your specific documents actually hold up under conversion.

Real-World Use Cases

Here is where this tool earns its 103,516 stars. The scenarios below are not toy examples.

RAG ingestion over a mixed-format document collection

You have 200 documents: quarterly reports in PDF, projections in XLSX, strategy decks in PPTX, meeting notes in DOCX. Before MarkItDown, each format needs a different library. After:

from markitdown import MarkItDown
import glob

md = MarkItDown(enable_plugins=False)
for path in glob.glob("docs/**/*", recursive=True):
    try:
        result = md.convert(path)
        index.add(result.text_content, metadata={"source": path})
    except Exception:
        pass  # unsupported format, move on

Two hundred documents. One API. The index gets filled.

The before/after: Excel without and with MarkItDown

The output quality difference is not subtle. A real multi-column Excel report:

Before, using xlrd directly:

import xlrd
wb = xlrd.open_workbook("report.xls")
sheet = wb.sheet_by_index(0)
rows = [sheet.row_values(i) for i in range(sheet.nrows)]
# Output: [['Quarter', 'Revenue', 'Growth'], ['Q1 2025', 1200000.0, 0.12], ...]
# Table structure is gone. Headers are just another Python list.

After, with MarkItDown:

from markitdown import MarkItDown
md = MarkItDown(enable_plugins=False)
print(md.convert("report.xls").text_content)

Output:

## Sheet1

| Quarter | Revenue | Growth |
|---------|---------|--------|
| Q1 2025 | $1.2M   | 12%    |
| Q2 2025 | $1.4M   | 16%    |
| Q3 2025 | $1.6M   | 14%    |

That table goes directly into an LLM prompt. No serialization code. No column-alignment logic. No second library to install.

Meeting notes from an MP3

markitdown standup-2026-04-12.mp3 > standup.md

Audio metadata plus a speech-to-text transcript in Markdown. Two seconds.

FastAPI document ingestion endpoint

from fastapi import UploadFile
from markitdown import MarkItDown
import io

md = MarkItDown()

@app.post("/convert")
async def convert_document(file: UploadFile):
    content = await file.read()
    result = md.convert_stream(
        io.BytesIO(content),
        file_extension=f".{file.filename.rsplit('.', 1)[-1]}"
    )
    return {"markdown": result.text_content}

One endpoint accepts any supported format and returns Markdown. No branching per content type.

AI agent with document-reading capability

An AutoGen agent that calls md.convert(path) as a tool can autonomously read PDFs, spreadsheets, and presentations. That is exactly the use case the AutoGen team built this for.

Which one of these patterns applies to your stack depends on what MarkItDown’s configuration actually unlocks.

How to Use It

Install:

# Full install -- every format
pip install 'markitdown[all]'

# Selective -- only what you need
pip install 'markitdown[pdf,docx,pptx]'

CLI (the fastest path from any file to Markdown):

markitdown report.pdf               # stdout
markitdown report.pdf -o report.md  # explicit output file
cat report.pdf | markitdown         # pipe input

Python API (the standard pattern for pipelines):

from markitdown import MarkItDown

md = MarkItDown(enable_plugins=False)
result = md.convert("report.pdf")
print(result.text_content)

Output:

## Q1 Financial Report

### Executive Summary

Revenue grew 12% year-over-year...

| Metric  | Q1 2025 | Change |
| ------- | ------- | ------ |
| Revenue | $1.2M   | +12%   |
| Growth  | +12%    | --     |

With LLM image descriptions (for diagrams, charts, and slide visuals):

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Describe this image for a technical audience."
)
result = md.convert("presentation.pptx")
# Each visual gets an inline LLM-generated description in the output

Streaming (for file upload endpoints):

import io
result = md.convert_stream(io.BytesIO(file_bytes), file_extension=".pdf")

💡 Tip: Use convert_stream() rather than writing to a temp file when processing uploads. Pass raw bytes as io.BytesIO and specify file_extension explicitly so the converter does not have to guess the format from an absent filename.

Configuration and Customization

ParameterTypeWhen to Use
enable_pluginsbool (default False)Set True to activate plugins like markitdown-ocr. Off by default for security isolation.
llm_clientOpenAI-compatiblePass for LLM image descriptions or Vision-based OCR
llm_modelstrModel name, e.g. "gpt-4o"
llm_promptstrOverride the default image description prompt
docintel_endpointstrAzure Document Intelligence endpoint for high-fidelity scanned PDF conversion

Install extras by format:

ExtraFormatsKey Packages
[pdf].pdfpdfminer.six, pdfplumber
[docx].docxmammoth, lxml
[pptx].pptxpython-pptx
[xlsx].xlsxpandas, openpyxl
[audio-transcription].wav, .mp3pydub, SpeechRecognition
[youtube-transcription]YouTube URLsyoutube-transcript-api
[az-doc-intel]Scanned PDFsazure-ai-documentintelligence
[all]EverythingUse for prototypes; trim for production

Built-in (no extra required): HTML, CSV, JSON, XML, EPUB, ZIP, plain text, images (EXIF metadata).

Knowing what is built-in does not tell you where this tool hits a wall.

Where It Fits (And Where It Doesn’t)

MarkItDown sits at the intake layer: before the LLM, before the RAG chunker, before the AI agent. It replaces the custom per-format conversion code that every team independently assembles and then quietly stops maintaining.

Works well with:

  • AutoGen agents (same team, natural fit)
  • LangChain pipelines and LlamaIndex RAG
  • OpenAI Assistants and Claude Desktop via markitdown-mcp
  • Any FastAPI or Flask document ingestion endpoint
  • Chroma, Pinecone, and Weaviate downstream of conversion

Doesn’t replace: Pandoc for document-to-document conversion with human-readable fidelity. A dedicated transcription service like Whisper for production audio pipelines. Tesseract for offline OCR without an API key.

The broader picture: There is a growing class of tools that treat Markdown as the lingua franca of both the AI era and modern technical writing. Markdy treats Markdown as an authoring format for interactive documentation. MarkItDown treats it as a universal output format for LLM pipelines. The direction is identical: complex structured data in, clean Markdown out, everything downstream benefits.

Scalability path:

  1. Prototype: pip install 'markitdown[all]', call md.convert() directly
  2. Team service: Wrap in FastAPI with convert_stream(), containerize with the included Dockerfile
  3. Production: Route high-value scanned PDFs through Azure Document Intelligence; expose document-reading to Claude Desktop via markitdown-mcp

Each step of that path runs into different rough edges.

The Rough Edges

The README is honest about this, so I will be too.

PDF quality depends on the PDF type. Text-based PDFs convert well. Scanned documents are a different problem. pdfminer.six reads embedded text; it does not do OCR. For scanned PDFs, use the markitdown-ocr plugin or route through Azure Document Intelligence.

⚠️ Warning: Both markitdown-ocr and LLM image descriptions require a live API key. There is no offline OCR path. If your pipeline needs to run air-gapped or without paid API calls, MarkItDown cannot do OCR without a billing account. Tesseract-based tools are the alternative for that constraint.

Audio transcription is basic. The bundled SpeechRecognition library handles demos. For production audio where accuracy matters, use Whisper or a dedicated transcription API and feed the resulting transcript to MarkItDown afterward.

Still Beta. Development Status 4. Between 0.0.1 and 0.1.0 there was a breaking change: convert_stream() now requires a binary file-like object, not a text stream. If you are upgrading from an older version, grep every convert_stream call and verify open(f, "rb").

YouTube transcription requires published captions. The library fetches the YouTube Transcript API text track. Videos without captions will fail silently or return nothing.

Plugins are disabled by default. enable_plugins=False is intentional. Community plugins get the same runtime access as core code once enabled. Treat enable_plugins=True as a deliberate security boundary, not a convenience flag.

Getting Started

The minimum path to a working result:

python -m venv .venv && source .venv/bin/activate
pip install 'markitdown[pdf,docx,pptx,xlsx]'
from markitdown import MarkItDown

md = MarkItDown(enable_plugins=False)
result = md.convert("your-file.pdf")
print(result.text_content)

Point that at a real file, not a toy one. The first time you see a messy multi-section Word document come out as readable Markdown with intact headings, the immediate next thought is almost always: “Where else in this codebase am I still doing this by hand?”

For Claude Desktop, install markitdown-mcp and configure it as an MCP server. Claude Desktop then gains on-demand conversion for any format MarkItDown supports, without a line of integration code.

How It Compares: Alternatives

The README directly names textract as the closest comparable tool. The key distinction is structural: textract extracts text and discards document structure; MarkItDown preserves headings, tables, and links as Markdown. For an LLM that reasons better with structure than flat text blobs, that gap matters.

Pandoc is the gold standard for human-readable document conversion. More formats, better fidelity for output meant for human eyes. But it does not handle PDFs, images, or audio, and its output is not optimized for LLM ingestion.

unstructured is the enterprise alternative for RAG preprocessing. More mature chunking and partitioning features, heavier install, more complex API, commercial licensing for advanced features. MarkItDown is simpler, fully MIT, and specifically designed for the “one call returns Markdown” use case that covers most pipelines.

FAQ

What file formats does MarkItDown support? PDF, DOCX, PPTX, XLSX, XLS, MSG (Outlook), WAV, MP3, YouTube URLs, HTML, CSV, JSON, XML, EPUB, ZIP, images (EXIF metadata), and plain text. Install markitdown[all] for the full set, or pick specific extras to keep the dependency footprint small. The Configuration section has the per-format breakdown with the exact packages each extra pulls in.

Does MarkItDown require an OpenAI API key? Only for LLM image descriptions in PPTX and image files, the markitdown-ocr plugin, and Azure Document Intelligence. Basic conversion for PDFs, Office docs, HTML, CSV, and most other formats runs with no API key at all.

How does MarkItDown compare to Pandoc for Markdown conversion? Pandoc is better for human-readable fidelity and supports more document format pairs. MarkItDown is better for LLM ingestion: it handles PDFs, images, audio, and YouTube that Pandoc does not touch, and its Markdown output is optimized for token efficiency rather than visual fidelity.

Can I use MarkItDown in a FastAPI service? Yes. Use convert_stream(io.BytesIO(file_bytes), file_extension=".pdf") to process uploaded files without writing to disk. The streaming API is designed exactly for this pattern. See the How to Use It section for the complete endpoint example.

Is MarkItDown production-ready? It is Beta (Development Status 4). Microsoft’s AutoGen team uses it in production AI pipelines, and the MIT license allows unrestricted commercial use. Pin the version in production and review the changelog before any upgrade. The breaking change between 0.0.1 and 0.1.0 confirms that version pinning is not optional.

Final Thoughts

The last time I assembled a multi-format document parser by hand, it took two weeks to cover four formats, had three separate failure modes for PDFs alone, and stopped being maintained the quarter after the engineer who built it moved to another team.

MarkItDown does not solve every document problem. Scanned PDFs need careful routing. Audio transcription is limited for production use. Beta status means you should pin your version and stay close to the changelog. But for the specific problem of collapsing a mixed-format document collection into clean Markdown for an LLM pipeline, the alternative is exactly that five-library frankenstein stack — which also has rough edges, also requires maintenance, and also stops working the day the business asks for one more format.

103,516 stars for a Python library that converts files to Markdown. That number makes complete sense the moment you have spent an afternoon explaining to pdfminer why a two-column annual report is not a single continuous stream of words.


microsoft/markitdown · MIT · 103.5k★

Hoang Yell

Hoang Yell

A software developer and technical storyteller. I spend my time exploring the most interesting open-source repositories on GitHub and presenting them as accessible stories for everyone.