Extract all text from a document
A recipe for pulling the plain text out of every story in reading order — walk StoryList, then each story tree, concatenating Content.
To extract a document's plain text, walk its stories in StoryList order and concatenate the Content of every range.
In short: Getting the plain text out of an IDML document means visiting its
stories in reading order and concatenating the text of every Content element. You
take the order from StoryList on the Document element, then walk each story tree
Story → ParagraphStyleRange → CharacterStyleRange → Content, treating Br and
Tab as whitespace. No styling needs to be resolved — extraction only cares about
which characters appear and where the boundaries are. This page gives the recipe and
the cases to get right.
To get the plain text of a document you walk its stories and concatenate the
Content of every range, in document order. You do not need to resolve any
styling — text extraction only cares about which characters appear and where the
boundaries are.
The recipe
- Open the design map and read
StoryListon theDocumentelement. It is a space-separated list of story ids, and it gives you the stories in document order. - For each id, open its story part (
Stories/Story_<id>.xml). One file holds oneStoryelement. - Walk the story tree in order:
Story→ParagraphStyleRange→CharacterStyleRange→Content. Append the text of eachContentelement as you reach it. - Treat the markers as whitespace. A
Bris a line break; emit\n. ATabis a tab; emit\t. Our parser already folds these into the run text this way, so aCharacterStyleRange's text comes out with the breaks in place. - End each paragraph with a newline. Each
ParagraphStyleRangeis one paragraph; put a\nbetween consecutive paragraphs so the output reads as paragraphs, not one run-on line.
Worked over a two-paragraph story
The story below has two paragraphs and one run each. Walking it in order yields:
The first paragraph stands on its own.
The second paragraph follows it.The first paragraph's Content, then a paragraph break, then the second
paragraph's Content. The SpaceBefore on the second range is a layout
attribute — extraction ignores it.
One story, two ParagraphStyleRange blocks — the unit your loop iterates over.
Stories/Story_ustory.xml<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<idPkg:Story xmlns:idPkg="http://ns.adobe.com/AdobeInDesign/idml/1.0/packaging" DOMVersion="20.0">
<Story Self="ustory">
<ParagraphStyleRange AppliedParagraphStyle="ParagraphStyle/$ID/[No paragraph style]">
<CharacterStyleRange AppliedCharacterStyle="CharacterStyle/$ID/[No character style]">
<Content>The first paragraph stands on its own.</Content>
</CharacterStyleRange>
</ParagraphStyleRange>
<ParagraphStyleRange AppliedParagraphStyle="ParagraphStyle/$ID/[No paragraph style]" SpaceBefore="6">
<CharacterStyleRange AppliedCharacterStyle="CharacterStyle/$ID/[No character style]">
<Content>The second paragraph follows it.</Content>
</CharacterStyleRange>
</ParagraphStyleRange>
</Story>
</idPkg:Story>
Things to get right
- Order comes from
StoryList, then document order within each story. Don't sort by story id or file name; the listed order is the reading order. - Some text is metadata, not body copy. Our parser drops
HiddenText,Note, andIndex/IndexEntrysubtrees from the flowed text. If you walk the raw XML yourself, skip those subtrees too, or your output will contain text the reader never sees. - A
TextVariableInstancecontributes itsResultText. Where a story carries a running header, page number, or file-name variable, the visible characters are the frozenResultText— see text variables. - Threaded stories are still one story. Text that flows across several frames
is a single
Storypart; you extract it once, regardless of how many frames display it. See threading and overset. - Tables hold their text in cells. A
Tablenested in a run carries its ownContentinsideCellelements; include them if you want the full copy.
Frequently asked questions
What order should I extract stories in?
The order given by StoryList on the Document element — a space-separated list of
story ids that is the reading order. Don't sort by story id or file name; within each
story, follow document order through the tree.
Do I need to resolve styles to extract text?
No. Extraction only cares about which characters appear and where the boundaries are,
so you can walk Story → ParagraphStyleRange → CharacterStyleRange → Content
and concatenate the text without touching Styles.xml.
How do I handle threaded stories that span several frames?
Extract them once. Text that flows across several frames is still a single Story
part, so the number of frames that display it doesn't change the extraction — see
threading and overset.
Why does my extracted text contain copy the reader never sees?
Probably because you walked the raw XML without skipping metadata subtrees. Our parser
drops HiddenText, Note, and Index / IndexEntry from the flowed text; if you
walk the XML yourself, skip those subtrees too, or they'll end up in your output.
Text variables
How TextVariableInstance and auto-page-number markers become characters in a run — frozen ResultText snapshots and private-use placeholder markers.
Why ranges, not spans
Why IDML models styled text as nested paragraph and character ranges rather than the inline spans of HTML, and what that means for walking the tree.