[xep-support] Indexing again

Nikolai Grigoriev grig at renderx.com
Tue Jan 15 16:10:48 PST 2002


Gustaf,

your plan to get indexes is good. I just add some comments here and there :-)

> 1. The XML file (the book) contains words/phrases marked up with <index>
> tags. The stylesheet have a page-sequence for the index, where index items
> are collected (word + page). There are no particular styling on these pages
> since they will be replaced later. And there needs to be no sorting at
> this stage.

Hints:

1. To facilitate further parsing, start each item (index entries
and page-numbers) on a separate line. This guarantees that you 
get a separate instance of <text> element in the output XML.

2. Invent some easily recognizable header and footer for your 
index sequence.

3. Ordering keywords and making a list of unique values can 
be easily performed in XSLT stylesheet. It's a common problem 
of grouping, usually solved using Muenchian method.

So, you get a piece of code in your stylesheet that looks like this:

<!-- A useful thing - all index entries are keyed by value -->
<xsl:key name="index-entries" match="index" use="."/>

<fo:page-sequence master-reference="index-page">
    <fo:flow flow-name="xsl-region-body">
        <fo:block>INDEX_STARTS_HERE</fo:block>

        <!-- External loop: choose unique index values -->
        <xsl:for-each
                select="//index[generate-id() = 
                               generate-id(key('index-entries', .))"/>
            <xsl:sort order="ascending"/>

            <!-- Print the keyword -->
            <fo:block><xsl:value-of select="."/></fo:block>

            <!-- Internal loop: iterate over all occurrences of one keyword -->
            <xsl:for-each select="key('index-entries', .)
                <fo:block>
                    <xsl:text>#</xsl:text>
                    <fo:page-number-citation ref-id="{generate-id()}"/>
                </fo:block>
            </xsl:for-each>

        </xsl:for-each>

        <fo:block>INDEX_ENDS_HERE</fo:block>
   </fo:flow>
</fo:page-sequence>

> 2. Generate the FO file. Now the index pages contains unique identifiers
> (not page numbers as I once thought, before I thought):
>
> <fo:page-number-citation ref-id="d0e1076"/>

Yes. You get something like this:

<fo:page-sequence master-reference="index-page">
  <fo:flow flow-name="xsl-region-body">
    <fo:block>INDEX_STARTS_HERE</fo:block>

    <fo:block>bar</fo:block>
        <fo:block>#<fo:page-number-citation ref-id="h3j4k5"/></fo:block>
        <fo:block>#<fo:page-number-citation ref-id="y6u4o3"/></fo:block>
    <fo:block>foo</fo:block>
         <fo:block>#<fo:page-number-citation ref-id="l6b3m4"/></fo:block>
         <fo:block>#<fo:page-number-citation ref-id="j6kdk4"/></fo:block>
         <fo:block>#<fo:page-number-citation ref-id="t1z2x3"/></fo:block>
      <fo:block>INDEX_ENDS_HERE</fo:block>
   </fo:flow>
</fo:page-sequence>


> 3. Run XEP in PDF mode to get the book visually, except the finished index
> pages.
> 4. Run XEP in XML mode to get the words + page numbers in readable XML
> markup.
> 5. Write a SAX filter to sort out the index data, and collect it in a new
> XML file (called the "index file").

I'd rathe postpone 3 until the very end. First things to do:
- run XEP in XML mode;
- run another XSLT stylesheet to extract just the index data.

In the XML representation, you should get something like this
(omitting position/dimension attributes)

<page>
  <text value="INDEX_STARTS_HERE"/>
  <text value="bar"/>
    <text value="#43"/>
    <text value="#115"/>
  <text value="foo"/>
    <text value="#18"/>
</page>
<page>
    <text value="#20"/>
    <text value="#20"/>
  <text value="INDEX_ENDS_HERE"/>
</page>

It is easy to extract index data from this form by a short XSLT script.
I even believe that you can merge this phase with the following one -
just write a stylesheet that operates on XEP XML output. I don't
think there is place for low-level SAX filter here - XSLT will do it
in 30-50 lines of code.

> 6. Write an FO stylesheet to generate the index pages from the index file.
> Styling and sorting is added here.

Note that you only have to rearrange page numbers here - remove duplicates
and eventually merge adjacent pages into ranges. Seems quite a feasible task.

> 7. Replace the index pages in the PDF.

Another option: append the newly generated FO fragment to the original
document, and format the whole thing in a single run. This will guarantee
a contiguous page numbering.

Moreover, you can write a stylesheet to do 5, 6, and 7 in a single run (:-)).
Such a stylesheet would take the original FO file as input; it would replace
the fo:page-sequence for index (from which you have generated the XML
format) by some new styled version, and copy all the rest. The data about
index entries can be loaded via document() function.

Hope this helps. It looks that my message turned out too clumsy; feel free
to ask for clarifications.

Regards,
Nikolai


-------------------
By using the Service, you expressly agree to these Terms of Service http://www.renderx.com/tos.html



More information about the Xep-support mailing list