返回顶部
p

pdf-utils

PDF processing skill for PyMuPDF and Tesseract workflows: OCR image-based PDFs, extract arXiv IDs from PDF text/OCR output, and handle scriptable PDF utility tasks when the built-in `pdf` tool is not enough. Use when working with scanned PDFs, OCR, arXiv reference mining, or repeatable local PDF-processing scripts.

作者: admin | 来源: ClawHub
源自
ClawHub
版本
V 1.0.1
安全检测
已通过
134
下载量
0
收藏
概述
安装方式
版本历史

pdf-utils

# PDF Utils Use this skill for **local, scriptable PDF processing**. It is a stable 1.x skill for OCR, arXiv reference mining, and repeatable PyMuPDF workflows. Prefer the built-in `pdf` tool for AI-style reading, summarization, question-answering, and semantic analysis of PDF content. ## Choose the right tool - Use the built-in **`pdf` tool** for summary, Q&A, extraction by meaning, or general document understanding. - Use **`scripts/extract_refs.py`** when the PDF already has extractable text and you need arXiv IDs or batch downloads. - Use **`scripts/ocr_pdf.py`** when the PDF is scanned/image-based and text extraction is poor or empty. - Use **`scripts/pdf_ops.py`** for repeatable local PDF operations such as merge, split, and rendering a page to an image. ## Core workflows ### Extract arXiv IDs from a text PDF Run: ```bash python3 scripts/extract_refs.py paper.pdf ``` If needed, download the referenced papers: ```bash python3 scripts/extract_refs.py paper.pdf --download --out ~/papers/ ``` ### OCR a scanned PDF Run OCR on all pages: ```bash python3 scripts/ocr_pdf.py paper.pdf --all ``` To OCR and immediately extract arXiv IDs from the OCR output: ```bash python3 scripts/ocr_pdf.py paper.pdf --all --extract-refs ``` ## Dependencies Install these before using OCR features: ```bash brew install tesseract brew install tesseract-lang pip3 install pytesseract Pillow pymupdf --break-system-packages ``` ## Read more only if needed - Read `references/usage.md` for CLI examples, programmatic API notes, PDF ops usage, and known limits. - Read the scripts directly if you need to patch behavior or reuse helper functions. ## Practical guidance - For very large PDFs, OCR in page ranges or batches instead of all at once. - For handwritten or low-resolution scans, expect OCR quality to drop. - If a PDF yields partial references, inspect the reference pages first instead of assuming extraction is complete. - For merge/split/page rendering, use `scripts/pdf_ops.py` first before writing one-off snippets.

标签

skill ai

通过对话安装

该技能支持在以下平台通过对话安装:

OpenClaw WorkBuddy QClaw Kimi Claude

方式一:安装 SkillHub 和技能

帮我安装 SkillHub 和 pdf-utils-1776101110 技能

方式二:设置 SkillHub 为优先技能安装源

设置 SkillHub 为我的优先技能安装源,然后帮我安装 pdf-utils-1776101110 技能

通过命令行安装

skillhub install pdf-utils-1776101110

下载 Zip 包

⬇ 下载 pdf-utils v1.0.1

文件大小: 12.73 KB | 发布时间: 2026-4-14 12:44

v1.0.1 最新 2026-4-14 12:44
1.0.1: clean patch release for metadata and publishing hygiene; remove artifact leakage from release flow; keep stable 1.0 test baseline.

Archiver·手机版·闲社网·闲社论坛·羊毛社区· 多链控股集团有限公司 · 苏ICP备2025199260号-1

Powered by Discuz! X5.0   © 2024-2025 闲社网·线报更新论坛·羊毛分享社区·http://xianshe.com

p2p_official_large
返回顶部