PyMuPDF Layout¶
PyMuPDF Layout is a lightweight layout analysis extension for PyMuPDF that turns PDFs into clean, structured data with minimal setup. It’s fast, accurate, and efficient without any GPU requirement.
It is an optional, but recommended, addition to the PyMuPDF library especially if you are required to more accurately extract structured data with better semantic information.
Installing¶
Install from PyPI with:
pip install pymupdf-layout
Using¶
In nutshell, PyMuPDF Layout detects the layout to extract, but we need PyMuPDF4LLM for the API interface. This provides us with options to extract document content as Markdown, JSON or TXT.
Let’s set up the Python coding environment to get started and open a PDF then we’ll move on to the semantic data extraction.
Register packages and open a PDF¶
First up let’s import the libraries and open a sample document:
import pymupdf.layout
import pymupdf4llm
doc = pymupdf.open("sample.pdf")
Note, in the above code, that PyMuPDF Layout must be imported as shown and before importing PyMuPDF4LLM to activate PyMuPDF’s layout feature and make it available to PyMuPDF4LLM.
Omitting the first line would cause execution of standard PyMuPDF4LLM - without the layout feature!
Extract the structured data¶
We’ve activated the PyMuPDF Layout library and we’ve loaded a document, next let’s extract the structured data. This is now like a super-charged version of standard PyMuPDF4LLM with Layout working behind the scenes combining heuristics with machine learning - for better extraction results.
Extract as Markdown¶
md = pymupdf4llm.to_markdown(doc)
Extract as JSON¶
json = pymupdf4llm.to_json(doc)
Extract as TXT¶
txt = pymupdf4llm.to_text(doc)
Note
Please refer top the full PyMuPDF4LLM API for more.
Finally we can save the output to an external file as follows:
from pathlib import Path
suffix = ".md" # or ".json" or ".txt"
Path(doc.name).with_suffix(suffix).write_bytes(md.encode())
Extending Capability¶
Using with Pro¶
We are able to extend PyMuPDF Layout to work with PyMuPDF Pro and thus increase our capability by allowing Office documents to be provided as input files. In this case all we have to do is to add the import for PyMuPDF Pro and unlock it:
import pymupdf.layout
import pymupdf4llm
import pymupdf.pro
pymupdf.pro.unlock()
Now we can happily load Office files and convert them as follows:
md = pymupdf4llm.to_markdown("sample.docx")
OCR support¶
The new layout-sensitive PyMuPDF4LLM version also evaluates whether a page would benefit from applying OCR to it. If its heuristics come to this conclusion, the built-in Tesseract-OCR module is automatically invoked. Its results are then handled like normal page content.
If a page contains (roughly) no text at all, but is covered with images or many character-sized vectors, a check is made using OpenCV whether text is probably detectable on the page at all. This is done to tell apart image-based text from ordinary pictures (like photographs).
If the page does contain text but too many characters are unreadable (like “�����”), OCR is also executed, but for the affected text areas only – not the full page. This way, we avoid losing already existing text and other content like images and vectors.
For these heuristics to work we need both, an existing Tesseract installation and the availability of OpenCV in the Python environment. If either is missing, no OCR is attempted at all.
The decision tree for whether OCR is actually used or not depends on the following:
In the PyMuPDF4LLM API you have
use_ocrenabled (this is set toTrueby default)OpenCV is available in your Python environment
PyMuPDF Layout and PyMuPDF4LLM parameter caveats¶
If you have imported pymupdf.layout, PyMuPDF4LLM changes its behavior in various areas quite significantly. New methods become available and also some features are no longer supported. Please visit this site for a detailed description of the changes. That web site is being kept up to date while we continue to work on improvements.
