org.pdfbox.util
Class PDFText2HTML

java.lang.Object
  extended by org.pdfbox.util.PDFStreamEngine
      extended by org.pdfbox.util.PDFTextStripper
          extended by org.pdfbox.util.PDFText2HTML

public class PDFText2HTML
extends PDFTextStripper

Wrap stripped text in simple HTML, trying to form HTML paragraphs. Paragraphs broken by pages, columns, or figures are not mended.

Version:
$Revision: 1.1 $
Author:
jjb - http://www.johnjbarton.com

Nested Class Summary
 
Nested classes/interfaces inherited from class org.pdfbox.util.PDFStreamEngine
PDFStreamEngine.StreamResources
 
Field Summary
 
Fields inherited from class org.pdfbox.util.PDFTextStripper
charactersByArticle, output
 
Fields inherited from class org.pdfbox.util.PDFStreamEngine
fontToAverageWidths, graphicsStack, operators, page, SPACE_BYTES, streamResourcesStack, textLineMatrix, textMatrix
 
Constructor Summary
PDFText2HTML()
          Constructor.
 
Method Summary
 void endDocument(PDDocument pdf)
          This method is available for subclasses of this class.
protected  void endParagraph()
          Write out the paragraph separator.
protected  void flushText()
          This will print the text to the output stream.
protected  String getTitleGuess()
          The guess to the document title.
protected  TextPosition guessTitle(Iterator textIter)
          This method will attempt to guess the title of the document.
 boolean isSuppressParagraphs()
           
 void setSuppressParagraphs(boolean shouldSuppressParagraphs)
           
protected  void startParagraph()
          Write out the paragraph separator.
protected  void writeCharacters(TextPosition position)
          Write the string to the output stream.
protected  void writeHeader()
          Write the header to the output document.
 
Methods inherited from class org.pdfbox.util.PDFTextStripper
endPage, getCharactersByArticle, getCurrentPageNo, getEndBookmark, getEndPage, getLineSeparator, getOutput, getPageSeparator, getStartBookmark, getStartPage, getText, getText, getWordSeparator, processPage, processPages, setEndBookmark, setEndPage, setLineSeparator, setPageSeparator, setShouldSeparateByBeads, setSortByPosition, setStartBookmark, setStartPage, setSuppressDuplicateOverlappingText, setWordSeparator, shouldSeparateByBeads, shouldSortByPosition, shouldSuppressDuplicateOverlappingText, showCharacter, startDocument, startPage, writeText, writeText
 
Methods inherited from class org.pdfbox.util.PDFStreamEngine
getColorSpaces, getCurrentPage, getFonts, getGraphicsStack, getGraphicsState, getGraphicsStates, getResources, getTextLineMatrix, getTextMatrix, getXObjects, processOperator, processOperator, processStream, processSubStream, setColorSpaces, setFonts, setGraphicsStack, setGraphicsState, setGraphicsStates, setTextLineMatrix, setTextMatrix, showString
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

PDFText2HTML

public PDFText2HTML()
             throws IOException
Constructor.

Throws:
IOException - If there is an error during initialization.
Method Detail

writeHeader

protected void writeHeader()
                    throws IOException
Write the header to the output document.

Throws:
IOException - If there is a problem writing out the header to the document.

getTitleGuess

protected String getTitleGuess()
The guess to the document title.

Returns:
A string that is the title of this document.

flushText

protected void flushText()
                  throws IOException
Description copied from class: PDFTextStripper
This will print the text to the output stream.

Overrides:
flushText in class PDFTextStripper
Throws:
IOException - If there is an error writing the text.
See Also:
PDFTextStripper.flushText()

endDocument

public void endDocument(PDDocument pdf)
                 throws IOException
Description copied from class: PDFTextStripper
This method is available for subclasses of this class. It will be called after processing of the document finishes.

Overrides:
endDocument in class PDFTextStripper
Parameters:
pdf - The PDF document that is being processed.
Throws:
IOException - If an IO error occurs.
See Also:
PDFTextStripper.endDocument( PDDocument )

guessTitle

protected TextPosition guessTitle(Iterator textIter)
This method will attempt to guess the title of the document.

Parameters:
textIter - The characters on the first page.
Returns:
The text position that is guessed to be the title.

startParagraph

protected void startParagraph()
                       throws IOException
Write out the paragraph separator.

Overrides:
startParagraph in class PDFTextStripper
Throws:
IOException - If there is an error writing to the stream.

endParagraph

protected void endParagraph()
                     throws IOException
Write out the paragraph separator.

Overrides:
endParagraph in class PDFTextStripper
Throws:
IOException - If there is an error writing to the stream.

writeCharacters

protected void writeCharacters(TextPosition position)
                        throws IOException
Description copied from class: PDFTextStripper
Write the string to the output stream.

Overrides:
writeCharacters in class PDFTextStripper
Parameters:
position - The text to write to the stream.
Throws:
IOException - If there is an error when writing the text.
See Also:
PDFTextStripper.writeCharacters( TextPosition )

isSuppressParagraphs

public boolean isSuppressParagraphs()
Returns:
Returns the suppressParagraphs.

setSuppressParagraphs

public void setSuppressParagraphs(boolean shouldSuppressParagraphs)
Parameters:
shouldSuppressParagraphs - The suppressParagraphs to set.


Copyright © 2006-2007 EGIZ - E-Government Innovationszentrum. All Rights Reserved.