HWP (한글)
HWP/HWPX Korean document create, read, edit via OfficeCLI
Default Active: Yes Binary HWP: Experimental
.hwp and .hwpx files exclusively. If the task involves DOCX, PPTX, XLSX, or PDF, it will not activate — use the corresponding format-specific skill instead.Quick Reference
Complete task-to-command lookup. Each row shows whether the operation is supported and which command to use.
| Task | Status | Command |
|---|---|---|
| Format like existing .hwpx | Yes | cp source.hwpx target.hwpx && officecli open target.hwpx |
| Template-based create | Yes | python3 scripts/build_hwpx.py --template {base|gonmun|minutes|proposal|report} |
| Create new .hwpx | Yes | officecli create file.hwpx |
| Create from Markdown | Yes | officecli create file.hwpx --from-markdown input.md |
| Read / analyze .hwpx | Yes | view text|annotated|outline|stats|html|markdown|tables|forms|objects |
| Edit existing .hwpx | Yes | set, add, remove, move, swap |
| Label-based fill | Yes | set /table/fill --prop 'fill:label=val' |
| Form recognize | Yes | view forms --auto |
| Table map | Yes | view tables |
| Markdown export | Yes | view markdown |
| Equation (수식) | Yes | add --type equation --prop 'script={1 over 2}' |
| Object finder | Yes | view objects |
| Query (expanded) | Yes | query 'tc[text~=홍길동]', :has(), > combinator |
| Template merge | Yes | merge template.hwpx out.hwpx --data '{"key":"val"}' |
| Swap elements | Yes | swap file.hwpx '/p[1]' '/p[2]' |
| Column break | Yes | add --type columnbreak --prop cols=2 |
| Image / floating | Yes | add --type picture --prop anchor=page --prop halign=center |
| Compare documents | Yes | compare a.hwpx b.hwpx (LCS diff + table compare) |
| Security validation | Yes | ZIP bomb, path traversal, symlink, XXE defense |
| Broken ZIP recovery | Yes | Corrupted HWPX auto-recovery via Local File Header scan |
| HTML preview | Yes | view html --browser |
| Watch live preview | Yes | watch file.hwpx |
| Validate .hwpx | Yes | validate (9-level check) |
| Raw XML | Yes | raw, raw-set |
| Watermark (image) | Yes | add --type watermark --prop src=img.png |
| Pattern-match editing | Python L4 | scripts/hwpx_cli.py open → pattern edit XML → save |
| Visual QA | Python L3 | scripts/contact_sheet.py + subagent review |
| New form field creation | Blocked | Source prototype exists; Hancom verification not closed |
| Create new .hwp (binary) | Experimental | officecli create file.hwp --json |
| Read/export .hwp (binary) | Experimental | officecli view file.hwp text --json + svg/png/pdf/markdown |
| Edit .hwp text/fields | Experimental | officecli hwp --json recipes |
| Export HWPX to .hwp | Experimental | officecli set input.hwpx /save-as-hwp --prop output=out.hwp --json |
| Convert .hwp to .hwpx | Fallback | scripts/hwp_convert.py IN.hwp OUT.hwpx |
"~해줌" Usage Examples
Natural Korean prompts that trigger this skill. Just describe what you need.
--template gonmun, then uses /table/fill to populate label-value fields like 문서번호, 수신, 참조, 제목.view tables, view forms --auto, and view stats to provide a structural breakdown of the document.officecli set form.hwpx / --prop 'fill:성명=홍길동' --prop 'fill:연락처=010-1234-5678'--template minutes, populates structured fields for date, location, attendees, and agenda items.officecli hwp doctor --json for native rhwp support. If ready, uses officecli view file.hwp text --json. Otherwise falls back to scripts/hwp_reader.py.Editing Escalation Ladder
When the primary tool cannot handle the job, the skill escalates through four levels. Lower levels are preferred; higher levels are used only when necessary.
| Level | When | Tool |
|---|---|---|
| L1 OfficeCLI high-level | Typical add/set/remove, label-fill, view modes | officecli add/set/remove/query/view/merge |
| L2 OfficeCLI raw/raw-set | Direct section0.xml / header.xml tweaks | officecli raw FILE /Contents/section0.xml |
| L3 Python script | Bulk find/replace, template assembly, pattern-match | python3 scripts/hwpx_cli.py ... or scripts/build_hwpx.py |
| L4 Unpack → edit XML → repack | KICE exams, regulations, multi-file XML edit | hwpx_cli.py open → edit work/Contents/*.xml → strip lineseg → save |
Escalation Signals
- OfficeCLI cannot add custom style → L2 (raw-set header.xml) + read
reference/header-xml-guide.md - Custom template overlay → L3 (
scripts/build_hwpx.py) + readreference/style_id_maps.md - HWP binary input → check
officecli hwp doctor --jsonfirst; native rhwp when ready, else L3 - Multi-file pattern match (exams, regulations) → L4
- Style ID lookup → read
reference/style_id_maps.mdfirst
Core Workflows
Create & Import
# Empty document
officecli create doc.hwpx
# From Markdown (JUSTIFY alignment by default)
officecli create doc.hwpx --from-markdown input.md
# Left-aligned Markdown import
officecli create doc.hwpx --from-markdown input.md --align left
# Binary HWP via rhwp bridge
officecli create file.hwp --json
# Template merge with data
officecli merge template.hwpx out.hwpx --data '{"이름":"홍길동"}'
officecli merge template.hwpx out.hwpx --data data.json
# Template assembly (recommended for structured docs)
python3 scripts/build_hwpx.py --template gonmun --output gonmun.hwpx
View Modes
officecli view doc.hwpx text # line-numbered text
officecli view doc.hwpx annotated # path + style detail
officecli view doc.hwpx outline # headings only
officecli view doc.hwpx stats # document statistics
officecli view doc.hwpx html --browser # A4 HTML preview
officecli view doc.hwpx markdown # GFM markdown export
officecli view doc.hwpx tables # table 2D grid + label map
officecli view doc.hwpx forms --auto # CLICK_HERE + label-value auto-detect
officecli view doc.hwpx forms --auto --json # JSON for AI pipeline
officecli view doc.hwpx objects # picture/field/bookmark/equation list
officecli view doc.hwpx objects --object-type field # filter by type
officecli view doc.hwpx styles # charPr/paraPr styles
officecli view doc.hwpx issues # 9-level validation issues
officecli view file.hwpx without a mode argument is an error. You must always specify the mode: text, tables, markdown, etc.Edit Operations
# Add paragraph with text and font size
officecli add doc.hwpx /section[1] --type paragraph --prop text="content" --prop fontsize=11
# Add table
officecli add doc.hwpx /section[1] --type table --prop rows=3 --prop cols=4
# Set properties (bold, alignment)
officecli set doc.hwpx '/section[1]/p[1]' --prop bold=true --prop align=CENTER
# Find and replace text
officecli set doc.hwpx / --prop find="old" --prop replace="new"
# Remove element
officecli remove doc.hwpx /section[1]/p[3]
# Swap two elements
officecli swap doc.hwpx '/p[1]' '/p[2]'
Label Fill (Table Auto-Fill)
# Fill by label (fill: prefix)
officecli set doc.hwpx / --prop 'fill:대표자=홍길동' --prop 'fill:연락처=010-1234'
# Directional fill: right (default), down, left, up
officecli set doc.hwpx / --prop 'fill:주소>down=서울시'
# Shorthand (fill: prefix optional with /table/fill path)
officecli set doc.hwpx /table/fill --prop '이름=김서준'
Query (Extended Syntax)
officecli query doc.hwpx 'p' # all paragraphs
officecli query doc.hwpx 'tc[text~=홍길동]' # cell text search
officecli query doc.hwpx 'run[bold=true]' # bold runs
officecli query doc.hwpx 'p:has(tbl)' # paragraphs containing tables
officecli query doc.hwpx 'tbl > tr > tc[colSpan!=1]' # merged cells
officecli query doc.hwpx 'run[fontsize>=20]' # 20pt+ font
officecli query doc.hwpx 'p[heading=1]' # heading 1
Operators: =, !=, ~= (contains), >=, <=
Pseudo-selectors: :empty, :contains(text), :has(child), :first, :last
Virtual attributes: text, bold, italic, fontsize, colSpan, rowSpan, heading
Resident Mode (Live Connection)
officecli open doc.hwpx # returns IMMEDIATELY; daemon in bg
officecli view text # view without re-opening
officecli set '/p[1]' --prop bold=true
officecli close # close session
officecli open as a background shell job. It returns immediately and the daemon lives in the background automatically. Running it as a monitored background shell creates zombies and file locks.Batch Mode
officecli batch doc.hwpx <<'EOF'
view text
view stats
view forms --auto
EOF
Compare
officecli compare a.hwpx b.hwpx # text diff (default)
officecli compare a.hwpx b.hwpx --mode outline # heading diff
officecli compare a.hwpx b.hwpx --mode table --json # table diff JSON
Uses LCS DP alignment (fallback greedy for >10M cells). Table similarity: dimension weight 0.3 + content weight 0.7. Page range filtering: --pages "1-3,5".
Image & Watermark
# Inline image
officecli add doc.hwpx /section[1] --type picture --prop path=/path/to/image.png
# Page-centered floating image
officecli add doc.hwpx /section[1] --type picture \
--prop path=/path/to/image.png \
--prop anchor=page --prop halign=center --prop valign=middle
# Watermark (opaque RGB PNG recommended)
officecli add doc.hwpx /section[1] --type watermark \
--prop src=/path/to/watermark.png --prop bright=0 --prop contrast=0
Watch & HTML Preview
officecli watch doc.hwpx # auto-refresh HTML on file change
officecli unwatch doc.hwpx # stop
officecli view doc.hwpx html --browser # one-shot A4 preview
Template Assembly System
HWP uses a unique base + overlay template system. Most HWPX creation for 공문/보고서/회의록/제안서 should use this instead of officecli create blank.
Available Templates
| Template | Purpose | Key Fields |
|---|---|---|
base | Empty HWPX skeleton | mimetype, META-INF, empty header/section |
gonmun | Official letter (공문) | 문서번호, 수신, 참조, 제목, 본문 |
minutes | Meeting minutes (회의록) | 일시, 장소, 참석자, 안건, 결정사항 |
proposal | Proposal (제안서) | 제안개요, 배경, 내용, 기대효과 |
report | Report (보고서) | 요약, 현황, 분석, 제언 |
Method 1: build_hwpx.py (Recommended)
python3 scripts/build_hwpx.py --template report --output Q4Report.hwpx
# Then edit content with officecli
officecli open Q4Report.hwpx
officecli set Q4Report.hwpx /table/fill --prop '제목=2026 Q4 보고'
# ... continue editing ...
officecli close Q4Report.hwpx
Method 2: Manual Overlay (for Customization)
# 1. Copy base skeleton
cp -r templates/base/ work/
# 2. Overlay domain-specific styles
cp -r templates/gonmun/* work/Contents/
# 3. Edit header.xml and section0.xml as needed
# Reference: reference/header-xml-guide.md, reference/section0-xml-guide.md
# Style IDs: reference/style_id_maps.md
# 4. Repack as HWPX (ZIP with strip + minify)
python3 scripts/ooxml/pack.py work/ out.hwpx
# 5. Validate
officecli validate out.hwpx
Reference-Based Editing
When the user says "format like X.hwpx", "공문 양식처럼", "기존 보고서 스타일", or provides a source file — start from the source, do not rebuild from scratch.
Workflow
- Copy the source:
cp source.hwpx target.hwpx— inherits header.xml (styles), section0.xml (structure), META-INF - Open:
officecli open target.hwpx— daemon returns immediately (do NOT run as background) - Remove body paragraphs only — keep
header.xml(charPr/paraPr/borderFill), META-INF, settings - Add new content using existing
styleidrefvalues — they auto-apply
header.xml holds all style definitions (charPr, paraPr, borderFill, listItems). Rebuilding from scratch breaks styleidref cross-references, loses consistent visual conventions, fails validation, and takes 10x longer.Template Priority
- User-provided source file — first-class template
tests/fixtures/agentic/*.hwpx— realistic samples (gonmun, report, minutes with tables)templates/{base,gonmun,minutes,proposal,report}/— Template Assembly systemofficecli createblank — only when nothing else applies
Form Recognition & Fill
4-Strategy Recognition
| # | Strategy | Description |
|---|---|---|
| 1 | Adjacent cell label-value | Table label → value detection (default) |
| 2 | Header + data rows | Column-header recognition |
| 3 | In-cell patterns | ☐ checkbox, keyword( ) paren-blank, (label: ) annotation |
| 4 | KV table detection | 16 Korean keywords trigger auto-detection |
3-Phase Fill Pipeline
- In-cell patterns: checkbox
☐→☑, paren-blank fill, annotation fill - Table label-value: exact + prefix 60% matching, 4-directional (
right/down/left/up) - Inline paragraph: regex lookbehind for
"label: value"outside tables
AI Form Fill Workflow
# Step 1: Recognize all fields
officecli view form.hwpx forms --auto --json > fields.json
# Step 2: AI maps label -> value (your logic here)
# Step 3: Fill matched fields
officecli set form.hwpx /table/fill --prop '성 명=홍길동'
Binary HWP Support (Experimental)
Native binary .hwp operations are powered by the rhwp bridge. All operations are capability-gated — always check readiness before assuming support.
Discovery Commands
# Always run these first
officecli hwp doctor --json # runtime readiness check
officecli capabilities --json # full capability matrix
officecli hwp --json # current recipes and policies
Supported Operations (When Ready)
# Create
officecli create file.hwp --json
# View / Export
officecli view file.hwp text --json
officecli view file.hwp svg --page 1 --json
officecli view file.hwp png --page 1 --out /tmp/hwp-png --json
officecli view file.hwp pdf --page 1 --out out.pdf --json
officecli view file.hwp markdown --json
officecli view file.hwp thumbnail --out thumb.png --json
officecli view file.hwp info --json
officecli view file.hwp tables --section 0 --json
# Edit fields
officecli set file.hwp /field --prop name=회사명 --prop value=리지 --prop output=out.hwp --json
# Edit text
officecli set file.hwp /text --prop find=마케팅 --prop value=브릿지 --prop output=out.hwp --json
# Insert text
officecli add file.hwp /text --type paragraph --prop value=새본문 --prop output=out.hwp --json
# Edit table cell
officecli set file.hwp /table/cell --prop section=0 --prop parent-para=3 \
--prop control=0 --prop cell=0 --prop value=오피스셀 --prop output=out.hwp --json
# Native operations (escape hatch)
officecli view file.hwp native --op get-style-list --json
officecli set file.hwp /native-op --prop op=split-paragraph --prop output=out.hwp --json
# HWPX to HWP export
officecli set input.hwpx /save-as-hwp --prop output=out.hwp --json
.hwp mutation: --prop output=out.hwp. Use in-place mode only for /text replacement, only when explicitly requested, and only after confirming safeInPlace.ready=true. In-place mode must include --in-place --backup --verify.Equation Handling (수식)
HWPX equations use Hancom's proprietary script language. This is NOT MathML, NOT LaTeX, NOT OMML.
| Script | Result |
|---|---|
{1 over 2} | 1/2 (fraction) |
sqrt{x} | Square root of x |
x^2, x_i | Superscript, subscript |
int _0 ^1 f(x)dx | Definite integral |
sum _{i=1} ^n | Sigma summation |
lim _{x->0} | Limit |
matrix{a&b # c&d} | 2x2 matrix |
# Create equation
officecli add doc.hwpx /section --type equation --prop 'script=x^2 + y^2 = r^2'
# View all equations
officecli view doc.hwpx objects --object-type equation
<hp:script> text, not the binary payload. Math exam docs (KICE) require <hp:equation> for every expression. Never use plain text for math.Korean Document Design Principles
Government Form Aesthetics (한국 공공양식 미감)
- Tables are the backbone: Korean forms are table-driven. Every label-value pair lives in a precisely merged cell grid. Preserve the grid structure exactly.
- Heading hierarchy: 제1조 (Articles) > 제1항 (Clauses) > 제1호 (Items). Use
styleidreffor outline levels. - Fixed margins: Government forms use standard A4 margins (top/bottom ~15mm, left/right ~20mm). Do not alter margins on existing documents.
- Alignment: Body text is JUSTIFY (양쪽 정렬) by default. Headings may be CENTER. Never use LEFT for body text in formal documents.
Uniform Spacing Detection (균등분할)
Korean forms often use uniform character spacing for names in cells: "홍 길 동" (spaces between each character). This is a display convention, not data.
- Reading: strip uniform spaces to get the actual value (
"홍길동") - Writing: if the template cell uses uniform spacing, insert spaces to match (2-char:
"이 준", 3-char:"홍 길 동", 4-char:"남궁민수") - Detection regex:
^(\S)\s(\S)\s(\S)$etc. (single-char groups separated by 1 space)
Document Type Classification
| Type | Key Signals | Example |
|---|---|---|
exam | equation 10+, rect objects | KICE 수능/모의고사 시험지 |
form | table 3+, checkboxes (☐/■) | 대학 신청서, 정부 양식 |
regulation | ○ bullets 10+, 별첨/조항 refs, table 10+ | 운영지침, 내규, 시행세칙 |
report | long text, few tables | 보고서, 논문 |
mixed | none of above | 사업계획서 |
Mandatory Verification
# 1. Structural validation (MUST pass)
officecli validate output.hwpx
# 2. PDF visual verification (MUST check)
soffice --headless --convert-to pdf --outdir /tmp output.hwpx
# Verify: table positions, guide text removed, checkboxes correct,
# merged cell text in correct row, numbers not corrupted
# 3. Visual QA via subagent (use reference/visual_qa_prompt.md)
python3 scripts/contact_sheet.py /tmp/output.pdf sheet.png
# 4. If Hancom Office available, also open .hwpx directly
Pre-Delivery Checklist
officecli validatepasses (0 errors)soffice --headless --convert-to pdf→ visual check- Table cells in correct positions (cellAddr mapping)
- Guide text (※, 예시) fully removed
- Checkboxes ☐/■ in intended cells only
- Merged cell text in correct row
- If Hancom available, open .hwpx directly
Security
| Check | Limits |
|---|---|
| ZIP bomb | 1000 entries, 200 MB, 100:1 ratio |
| Path traversal | null byte, .., absolute path, drive letter, symlink |
| XXE | DtdProcessing.Prohibit |
| Table size | 200 cols x 10,000 rows |
Reference Materials & Scripts
Reference Documents (reference/)
| File | Read When | Contains |
|---|---|---|
reference/hwpx-format.md | Before any direct XML edit | OWPML ZIP structure, namespaces, file layout, mimetype |
reference/header-xml-guide.md | Adding/modifying charPr/paraPr/borderFill styles | How to add new styles to header.xml |
reference/section0-xml-guide.md | Paragraph/table/mixed-formatting XML | XML template for section0.xml bodies |
reference/style_id_maps.md | Style ID lookup for template overlay | Complete style ID index for all templates |
reference/dependencies.md | First-time setup / environment check | Python/system packages needed |
reference/visual_qa_prompt.md | Visual QA via subagent | Ready-to-use prompt for PDF-image inspection |
reference/table_templates/*.xml | Inserting pre-built tables | 2x6, 3x3, 4x4, 5x4 grid XML fragments |
Python Scripts (scripts/)
| Script | Purpose | Command |
|---|---|---|
hwpx_cli.py | Unified Python CLI (14+ commands) | python3 scripts/hwpx_cli.py {command} ... |
build_hwpx.py | Template-based creation | python3 scripts/build_hwpx.py --template {type} |
analyze_template.py | Inspect template structure | python3 scripts/analyze_template.py work/ |
create_document.py | Create empty or custom HWPX | python3 scripts/create_document.py OUT.hwpx |
table_builder.py | Build table XML from Python objects | Used internally |
page_guard.py | Detect paragraph/table/text drift | python3 scripts/page_guard.py -r ref.hwpx -o out.hwpx |
contact_sheet.py | QA contact sheet (page grid image) | python3 scripts/contact_sheet.py INPUT.pdf sheet.png |
validate.py | 9-level structural validation | python3 scripts/validate.py INPUT.hwpx |
hwp_reader.py | Read HWP 5.0 binary (read-only) | python3 scripts/hwp_reader.py INPUT.hwp |
hwp_convert.py | HWP → HWPX conversion | python3 scripts/hwp_convert.py IN.hwp OUT.hwpx |
text_extract.py | Extract plain text from HWPX | python3 scripts/text_extract.py INPUT.hwpx |
Pattern-Match Editing (L4 Fallback)
For complex form editing beyond officecli set/find-replace — KICE exams, multi-section regulations, fragmented text nodes.
Core Flow
# Unpack HWPX
python3 scripts/hwpx_cli.py open input.hwpx
# Edit XML files in work/Contents/
# (lineseg is stripped automatically by hwpx_cli.py)
# Repack
python3 scripts/hwpx_cli.py save output.hwpx
Key Patterns
| Pattern | Description |
|---|---|
| Lineseg strip | Remove stale <hp:linesegarray> cache on every direct XML write |
| Checkbox substitution | ☐ → ☑, with multi-<t> node handling |
| Uniform-space normalization | "홍 길 동" ↔ "홍길동" conversion |
| p[0] Monster | secPr + tbl + question 1 text merged in first paragraph |
| Equation interleaving | <t> ↔ <equation> alternating — skip equations during text extraction |
Lineseg Strip (Critical)
When editing HWPX XML directly (NOT via officecli or scripts/hwpx_cli.py), you MUST strip ALL <hp:linesegarray> elements. Stale layout cache causes characters to overlap into a single line.
import re
xml = re.sub(r'<(?:hp:)?linesegarray[^>]*>.*?</(?:hp:)?linesegarray>', '', xml, flags=re.DOTALL)
xml = re.sub(r'<(?:hp:)?linesegarray[^/]*/>', '', xml) # self-closing
officecli and scripts/hwpx_cli.py handle lineseg stripping automatically. This rule applies only to raw Python XML editing.Common Pitfalls
| Pitfall | Correct Approach |
|---|---|
--props text=Hello | --prop text=Hello — singular --prop always |
/body/p[1] path | HWPX uses /section[1]/p[1] — section-based, not body |
Unquoted [N] in shell | "/section[1]/p[1]" — always quote paths |
fontsize omitted | --prop fontsize=11 always — prevents charPr 0 pollution |
officecli view file.hwpx (no mode) | Error. Must specify: text, markdown, tables, etc. |
| Manual table mapping | view tables replaces manual inspection |
| Recreating header.xml styles from template | cp source.hwpx target.hwpx first. Read reference/style_id_maps.md |
officecli open as background shell | Run foreground — returns immediately, daemon runs in bg automatically |
| Direct XML edit without lineseg strip | Stale cache causes text overlap. Use hwpx_cli.py or strip manually |
| Custom style work without reading reference/ | reference/header-xml-guide.md + reference/style_id_maps.md are mandatory |
Anti-Patterns (Must Avoid)
- No equations in math exams = broken output — KICE docs require
<hp:equation>elements - No unguarded HWP binary overwrite — prefer
--prop output=...; only use safe in-place whensafeInPlace.ready=true - No fake HWPX fallback when rhwp has a native primitive — if OfficeCLI lacks the route, report the gap and stop for approval
- No XML editing without lineseg strip — stale cache causes overlapping text
- No visible QA markers in fixed-layout exams — use screenshots or sidecar evidence instead
- No cross-format skill loading — this skill is
.hwp/.hwpxonly - No rebuilding styles that exist in template —
cpfirst and readreference/style_id_maps.md - No ignoring reference materials —
header-xml-guide.md,section0-xml-guide.md, andstyle_id_maps.mdare mandatory for custom XML work
Dependencies
| Tool | Purpose | Required? |
|---|---|---|
officecli (global) | Primary HWPX CLI + experimental rhwp-backed HWP bridge | Required |
python3 | Fallback scripts (scripts/*.py) | Required for L3/L4 |
lxml | XML processing for scripts/* | Required for L3/L4 |
pyhwp | Legacy HWP 5.0 binary reading/conversion fallback | Required for HWP→HWPX fallback |
soffice (LibreOffice) | PDF conversion + visual verification | Recommended |
Java (JAVA_HOME) | H2Orestart HWP conversion engine | For HWP→HWPX only |
dotnet | Build officecli from source | For builds only |
Prerequisite Check
# OfficeCLI (required)
which officecli >/dev/null 2>&1 || echo "WARN: OfficeCLI not installed"
# LibreOffice (recommended, auto-installs when needed)
which soffice >/dev/null 2>&1 || echo "INFO: LibreOffice not installed"
# Python packages (optional for L3/L4)
python3 -c "import lxml; import pyhwp" 2>/dev/null || echo "OPTIONAL: pip install lxml pyhwp"
# Java (for HWP conversion only)
echo "JAVA_HOME=$JAVA_HOME"
Tool Discovery
Always confirm syntax from help before guessing.
officecli --help
officecli hwp --json
officecli hwp doctor --json
officecli capabilities --json
officecli view --help
officecli set --help
python3 scripts/hwpx_cli.py --help
python3 scripts/build_hwpx.py --help
Exam XML Structure Patterns
KICE-style exam sheets require special handling due to fixed-layout constraints.
| Pattern | Description | Detection |
|---|---|---|
| Page/Column breaks | pageBreak="1" / columnBreak="1" | Page boundary = question group boundary |
| p[0] Monster | secPr + colPr + title tbl + question 1 text merged | Everything in first paragraph |
| Equation interleaving | <t> ↔ <equation> alternating | Skip equations during text extraction |
| Answer choices | ① + 5 <equation> (5-choice) | Auto-detect answer paragraphs |
| Text fragmentation | 1-2 char <t> splits (HWP conversion) | Concatenate all text then match |
| 2-column layout | <hp:colPr type="NEWSPAPER" colCount="2"> | Exam-specific layout |
[CU TEMPLATE EDIT ...] or VISUAL QA inside the question body are visual hard failures. Capture before/after screenshots instead.HWP → HWPX Conversion
Format Detection
file doc.hwpx # "Zip archive" -> HWPX (ZIP + OWPML XML)
file doc.hwp # "HWP Document" -> HWP 5.0 binary
Structural Differences
| Aspect | Native HWPX | HWP → HWPX Converted |
|---|---|---|
| Text unit | Short <t> per run | Entire paragraph in one <t> |
| Title p[0] | secPr + tbl + content | Page number fragments <t>20</t> + <t>1</t> mixed in |
| Editing | Run-level precise replacement | Raw string replace or whole-paragraph swap needed |
Editing Strategies for Converted Files
- Title: run-aware replacement —
set_run_text(p0, 'old', 'new')(skip page-number runs) - Body: raw string replace on serialized XML —
sec0.replace(old, new) - Multi-
<t>cells: useReplaceTextInCell()— concatenate all<t>→ match → redistribute