Package org.jpedal.examples.text
Class ExtractTextInRectangle
java.lang.Object
org.jpedal.examples.BaseExample
org.jpedal.examples.text.ExtractTextInRectangle
public class ExtractTextInRectangle
extends org.jpedal.examples.BaseExample
Extract text from PDF files
This class provides a simple Java API to extract text from a PDF file and also a static convenience method if you just want to dump all the text from a PDF file or directory containing PDF files
See our Support Pages for more information on Text Extraction.
- 
Nested Class SummaryNested ClassesModifier and TypeClassDescriptionstatic enumThe available formats that text can be output as
- 
Constructor SummaryConstructorsConstructorDescriptionExtractTextInRectangle(byte[] byteArray) Sets up an ExtractTextInRectangle instance to open a PDF file contained as a BLOB within a byte[] streamExtractTextInRectangle(byte[] byteArray, boolean extractPlainText) Sets up an ExtractTextInRectangle instance to open a PDF file contained as a BLOB within a byte[] streamExtractTextInRectangle(String fileName) Sets up an ExtractTextInRectangle instance to open a PDF FileExtractTextInRectangle(String fileName, boolean extractPlainText) Sets up an ExtractTextInRectangle instance to open a PDF File
- 
Method SummaryModifier and TypeMethodDescriptionvoiddecodeFile(String file_name) routine to decode a fileintnumber of pages in PDF file (starting at 1)getTextOnPage(int page) extract all text on page as a string value.getTextOnPage(int page, int x1, int y1, int x2, int y2) extract all text on page in a specified region as a string value.If the page contains text with multiple orientations (Left to right, bottom to top), only the most common orientation will be extracted and others will be ignoredgetTextOnPage(int page, Rectangle rectangle) extract all text on page in a specified region as a string value.static voidThis class will allow you to extract all text from page via command line from a single PDF file or a directory of PDF files.voidsetEstimateParagraphs(boolean estimateParagraphs) voidSets which output format to use, XML or TXTvoidsetPassword(String password) static voidwriteAllTextToDir(String inputDir, String outputDir, int maxPages) Convenience method to write all the text in a directory of PDF filesstatic voidwriteAllTextToDir(String inputDir, String password, String outputDir, int maxPages) Convenience method to write all the text in a directory of PDF filesstatic voidwriteAllTextToDir(String inputDir, String password, String outputDir, int maxPages, ExtractTextInRectangle.OUTPUT_FORMAT format, boolean estimateParagraphs) Convenience method to write all the text in a directory of PDF filesstatic voidwriteAllTextToDir(String inputDir, String password, String outputDir, int maxPages, ExtractTextInRectangle.OUTPUT_FORMAT format, boolean estimateParagraphs, ErrorTracker errorTracker) Convenience method to write all the text in a directory of PDF filesMethods inherited from class org.jpedal.examples.BaseExampleclosePDFfile, openPDFFile
- 
Constructor Details- 
ExtractTextInRectangleSets up an ExtractTextInRectangle instance to open a PDF File- Parameters:
- fileName- full path to a single PDF file
 
- 
ExtractTextInRectangleSets up an ExtractTextInRectangle instance to open a PDF File- Parameters:
- fileName- full path to a single PDF file
- extractPlainText- flag to extract plain text rather than XML
 
- 
ExtractTextInRectanglepublic ExtractTextInRectangle(byte[] byteArray) Sets up an ExtractTextInRectangle instance to open a PDF file contained as a BLOB within a byte[] stream- Parameters:
- byteArray- pdf file data
 
- 
ExtractTextInRectanglepublic ExtractTextInRectangle(byte[] byteArray, boolean extractPlainText) Sets up an ExtractTextInRectangle instance to open a PDF file contained as a BLOB within a byte[] stream- Parameters:
- byteArray- pdf file data
- extractPlainText- flag to extract plain text rather than XML
 
 
- 
- 
Method Details- 
setOutputFormatSets which output format to use, XML or TXT- Parameters:
- format- the output format to use
 
- 
setEstimateParagraphspublic void setEstimateParagraphs(boolean estimateParagraphs) 
- 
decodeFileroutine to decode a file- Throws:
- PdfException
 
- 
getTextOnPageextract all text on page as a string value.If the page contains text with multiple orientations (Left to right, bottom to top), only the most common orientation will be extracted and others will be ignored - Parameters:
- page- number (first page is 1)
- Returns:
- String with text
- Throws:
- PdfException- if problem with parsing and extraxting text from PDF file
 
- 
getTextOnPageextract all text on page in a specified region as a string value. If the page contains text with multiple orientations (Left to right, bottom to top), only the most common orientation will be extracted and others will be ignored- Parameters:
- page- (first page is 1)
- rectangle- - top left corner x
- Returns:
- String with text
- Throws:
- PdfException- if problem with parsing and extraxting text from PDF file
 
- 
getTextOnPageextract all text on page in a specified region as a string value.If the page contains text with multiple orientations (Left to right, bottom to top), only the most common orientation will be extracted and others will be ignored- Parameters:
- page- (first page is 1)
- x1- - top left corner x
- y1- - top left corner y
- x2- - bottom right corner x
- y2- - bottom right corner y
- Returns:
- String with text
- Throws:
- PdfException- if problem with parsing and extracting text from PDF file
 
- 
mainThis class will allow you to extract all text from page via command line from a single PDF file or a directory of PDF files.
 The example expects two:- Value 1 is the file name or directory of PDF files to process
- Value 2 is directory to write out the data
 - Parameters:
- args- The expected arguments are described above.
 
- 
writeAllTextToDirpublic static void writeAllTextToDir(String inputDir, String password, String outputDir, int maxPages, ExtractTextInRectangle.OUTPUT_FORMAT format, boolean estimateParagraphs) throws PdfException Convenience method to write all the text in a directory of PDF files- Parameters:
- inputDir- directory containing PDF files
- password- user or owner password for PDF files
- outputDir- directory for writing out wordlists
- maxPages- limit to just the first maxPages of a document
- format- set the output format for the text content (TXT or XML)
- estimateParagraphs- set if JPedal should estimate paragraph spacing in output.
- Throws:
- PdfException- if problem with parsing and extracting text from PDF file
 
- 
writeAllTextToDirpublic static void writeAllTextToDir(String inputDir, String password, String outputDir, int maxPages, ExtractTextInRectangle.OUTPUT_FORMAT format, boolean estimateParagraphs, ErrorTracker errorTracker) throws PdfException Convenience method to write all the text in a directory of PDF files- Parameters:
- inputDir- directory containing PDF files
- password- user or owner password for PDF files
- outputDir- directory for writing out wordlists
- maxPages- limit to just the first maxPages of a document
- format- set the output format for the text content (TXT or XML)
- estimateParagraphs- set if JPedal should estimate paragraph spacing in output.
- errorTracker- a custom error tracker
- Throws:
- PdfException- if problem with parsing and extracting text from PDF file
 
- 
writeAllTextToDirpublic static void writeAllTextToDir(String inputDir, String password, String outputDir, int maxPages) throws PdfException Convenience method to write all the text in a directory of PDF files- Parameters:
- inputDir- directory containing PDF files
- password- user or owner password for PDF files
- outputDir- directory for writing out wordlists
- maxPages- limit to just the first maxPages of a document
- Throws:
- PdfException- if problem with parsing and extracting text from PDF file
 
- 
writeAllTextToDirpublic static void writeAllTextToDir(String inputDir, String outputDir, int maxPages) throws PdfException Convenience method to write all the text in a directory of PDF files- Parameters:
- inputDir- directory containing PDF files
- outputDir- directory for writing out wordlists
- maxPages- limit to just the first maxPages of a document
- Throws:
- PdfException- if problem with parsing and extracting text from PDF file
 
- 
setPassword- Parameters:
- password- the USER or OWNER password for the PDF file
 
- 
getPageCountpublic int getPageCount()number of pages in PDF file (starting at 1)- Returns:
- page count
 
 
-