processing-documentslisted
Install: claude install-skill Open330/agt
# Document Processor
Analyze, summarize, and convert office documents.
## Supported Formats
| Format | Read | Write | Tools |
|--------|------|-------|-------|
| PDF | ✅ | ✅ | pdfplumber, pypdf |
| DOCX | ✅ | ✅ | python-docx |
| XLSX | ✅ | ✅ | openpyxl |
| PPTX | ✅ | ✅ | python-pptx |
| HWP | ✅ | ❌ | python-hwp, olefile |
| HWPX | ✅ | ❌ | zipfile, xml.etree |
## Quick Reference
### PDF Text Extraction
```python
import pdfplumber
with pdfplumber.open("doc.pdf") as pdf:
text = "\n".join(p.extract_text() for p in pdf.pages)
```
### Excel Reading
```python
import openpyxl
wb = openpyxl.load_workbook("data.xlsx")
ws = wb.active
data = [[cell.value for cell in row] for row in ws.iter_rows()]
```
### Word Document
```python
from docx import Document
doc = Document("report.docx")
text = "\n".join(p.text for p in doc.paragraphs)
```
### HWP Document (한글)
```python
# Using pyhwp (hwp5) - primary library
from hwp5.xmlmodel import Hwp5File
hwp = Hwp5File("document.hwp")
for section in hwp.bodytext:
# Process paragraphs in each section
pass
```
```python
# Using olefile - low-level OLE access fallback
import olefile
ole = olefile.OleFileIO("document.hwp")
# HWP stores text in "BodyText/Section0", "BodyText/Section1", etc.
streams = [s for s in ole.listdir() if s[0] == "BodyText"]
for stream_path in streams:
data = ole.openstream(stream_path).read()
# HWP uses UTF-16LE encoding for text content
text = data.decode("utf-16-le", errors="ignore")
```
###