All pages PDF stripper

  • (2 Pages)
  • +
  • 1
  • 2

19 Replies - 736 Views - Last Post: 30 May 2020 - 11:47 AM Rate Topic: -----

#1 leace   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 18
  • Joined: 26-May 20

All pages PDF stripper

Posted 26 May 2020 - 02:40 PM

I am extracting the text from PDF file. At the moment only first page data coming and other pages data is missing.

What code is missing to view the other pages details.


public class PDFBoxReadFromFile {


  public static void main(String[] args) throws Exception {

    try (PDDocument document = PDdocument.load(new File("C:\\Users\\ed\\Documents\\test2.pdf"))) {

      if (!document.isEncrypted()) {
        PDFTextStripperByArea stripper = new PDFTextStripperByArea();
        stripper.setSortByPosition(true);
        Rectangle2D rect4 = new Rectangle2D.Double(210, 160, 230, 25);
        Rectangle rect1 = new Rectangle(55, 290, 225, 17);

        Rectangle2D rect2 = new Rectangle2D.Double(281, 255, 255, 20);
        Rectangle2D rect3 = new Rectangle2D.Double(2, 365, 660, 1900);
        stripper.addRegion("class2", rect1);
        stripper.addRegion("class5", rect4);
        PDPage firstPage = document.getPages().get(0);
        stripper.extractRegions(firstPage);
        System.out.print(stripper.getTextForRegion("class5"));
        System.out.print(stripper.getTextForRegion("class2"));


        File file = new File("C:/Users/ed/eclipse-workspace/pdfboxreadfromfile/file.txt");
        FileWriter fw = new FileWriter(file);
        PrintWriter pw = new PrintWriter(fw);
        pw.println(stripper.getTextForRegion("class5"));
        pw.println(stripper.getTextForRegion("class2"));

        pw.close();

      }
    } catch (IOException e) {
      System.err.println("Exception while trying to read pdf document - " + e);
    }
  }



Is This A Good Question/Topic? 0
  • +

Replies To: All pages PDF stripper

#2 NormR   User is online

  • D.I.C Lover
  • member icon

Reputation: 834
  • View blog
  • Posts: 6,402
  • Joined: 25-December 13

Re: All pages PDF stripper

Posted 26 May 2020 - 02:43 PM

Quote

What code is missing

Can you post the API doc for the packages and classes you are using? I imagine what you are looking for will be discussed in the API doc.

Also what are the import statements for the posted code?

This post has been edited by NormR: 26 May 2020 - 02:44 PM

Was This Post Helpful? 0
  • +
  • -

#3 leace   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 18
  • Joined: 26-May 20

Re: All pages PDF stripper

Posted 26 May 2020 - 02:49 PM

View PostNormR, on 26 May 2020 - 02:43 PM, said:

Quote

What code is missing

Can you post the API doc for the packages and classes you are using? I imagine what you are looking for will be discussed in the API doc.

Also what are the import statements for the posted code?


I am using PDFBox to extract the text from pdf document. i have not received any API doc for the same. All this happening by googling the online datas:)

If anyone can add value to review others pages well and good
Was This Post Helpful? 0
  • +
  • -

#4 NormR   User is online

  • D.I.C Lover
  • member icon

Reputation: 834
  • View blog
  • Posts: 6,402
  • Joined: 25-December 13

Re: All pages PDF stripper

Posted 26 May 2020 - 02:54 PM

what are the import statements for the posted code?
Is there a jar file with the classes?

Where did it come from?
Was This Post Helpful? 0
  • +
  • -

#5 leace   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 18
  • Joined: 26-May 20

Re: All pages PDF stripper

Posted 26 May 2020 - 02:57 PM

View PostNormR, on 26 May 2020 - 02:54 PM, said:

what are the import statements for the posted code?
Is there a jar file with the classes?

Where did it come from?


import statements below.
jar files "pdfbox-app-2.0.19.jar"
import java.io.IOException;
import java.util.logging.Level;
import java.util.logging.Logger;
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.PDFTextStripperByArea;
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.io.RandomAccessFile;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.io.*;
import org.apache.pdfbox.text.TextPosition;
import org.apache.pdfbox.util.Matrix;
import java.awt.Color;
import java.awt.Dimension;
import java.awt.Point;
import java.awt.Rectangle;
import java.awt.geom.AffineTransform;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.Map;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageTree;
import javax.xml.transform.dom.DOMSource;

import org.w3c.dom.Document;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.text.PDFTextStripperByArea;
import java.awt.*;
import java.io.File;
import java.io.IOException;

import java.awt.geom.Rectangle2D;


Was This Post Helpful? 0
  • +
  • -

#6 NormR   User is online

  • D.I.C Lover
  • member icon

Reputation: 834
  • View blog
  • Posts: 6,402
  • Joined: 25-December 13

Re: All pages PDF stripper

Posted 26 May 2020 - 03:49 PM

Here is a link to the API doc: https://pdfbox.apach...2.0.0/javadocs/
read that to see what is available for solving your problem.
Was This Post Helpful? 0
  • +
  • -

#7 g00se   User is online

  • D.I.C Lover
  • member icon

Reputation: 3702
  • View blog
  • Posts: 16,962
  • Joined: 20-September 08

Re: All pages PDF stripper

Posted 27 May 2020 - 04:52 AM

Quote

At the moment only first page data coming and other pages data is missing.

because you only ask for the first page:

Quote

PDPage firstPage = document.getPages().get(0);

Was This Post Helpful? 0
  • +
  • -

#8 leace   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 18
  • Joined: 26-May 20

Re: All pages PDF stripper

Posted 27 May 2020 - 08:33 AM

View Postg00se, on 27 May 2020 - 04:52 AM, said:

Quote

At the moment only first page data coming and other pages data is missing.

because you only ask for the first page:

Quote

PDPage firstPage = document.getPages().get(0);


Do you have code for all the pages . please share
Was This Post Helpful? 0
  • +
  • -

#9 NormR   User is online

  • D.I.C Lover
  • member icon

Reputation: 834
  • View blog
  • Posts: 6,402
  • Joined: 25-December 13

Re: All pages PDF stripper

Posted 27 May 2020 - 08:49 AM

Quote

code for all the pages

Write a loop that changes the value passed to the get method from 0 to the number of pages.
Was This Post Helpful? 0
  • +
  • -

#10 leace   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 18
  • Joined: 26-May 20

Re: All pages PDF stripper

Posted 27 May 2020 - 09:00 AM

View PostNormR, on 27 May 2020 - 08:49 AM, said:

Quote

code for all the pages

Write a loop that changes the value passed to the get method from 0 to the number of pages.


Thank you. Can you share code with loop changes to the get method from 0 to 6 pages

would be helpful
Was This Post Helpful? 0
  • +
  • -

#11 modi123_1   User is online

  • Suitor #2
  • member icon



Reputation: 15743
  • View blog
  • Posts: 63,068
  • Joined: 12-June 08

Re: All pages PDF stripper

Posted 27 May 2020 - 09:02 AM

Please be cognizant about the rules on asking people to do your work for you.

Try writing that loop first!
Was This Post Helpful? 0
  • +
  • -

#12 leace   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 18
  • Joined: 26-May 20

Re: All pages PDF stripper

Posted 27 May 2020 - 09:25 AM

View PostNormR, on 27 May 2020 - 08:49 AM, said:

Quote

code for all the pages

Write a loop that changes the value passed to the get method from 0 to the number of pages.


Tried follows but no luck .
        PDPageTree allPages = document.getDocumentCatalog().getPages();

	                for (int i = 0; i < allPages.getCount(); i++) {
	                    PDPage page = allPages.get(i);
stripper.extractRegions( page );



Was This Post Helpful? 0
  • +
  • -

#13 NormR   User is online

  • D.I.C Lover
  • member icon

Reputation: 834
  • View blog
  • Posts: 6,402
  • Joined: 25-December 13

Re: All pages PDF stripper

Posted 27 May 2020 - 09:33 AM

Can you post all the of code so we can see where those few lines fit in with the whole program?
Was This Post Helpful? 0
  • +
  • -

#14 leace   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 18
  • Joined: 26-May 20

Re: All pages PDF stripper

Posted 27 May 2020 - 09:53 AM

View PostNormR, on 27 May 2020 - 09:33 AM, said:

Can you post all the of code so we can see where those few lines fit in with the whole program?

Full code below.


package pdfboxreadfromfile;

import java.io.IOException;
import java.util.logging.Level;
import java.util.logging.Logger;
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.PDFTextStripperByArea;
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.io.RandomAccessFile;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.io.*;
import org.apache.pdfbox.text.TextPosition;
import org.apache.pdfbox.util.Matrix;
import java.awt.Color;
import java.awt.Dimension;
import java.awt.Point;
import java.awt.Rectangle;
import java.awt.geom.AffineTransform;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.Map;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageTree;
import javax.xml.transform.dom.DOMSource;

import org.w3c.dom.Document;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.text.PDFTextStripperByArea;
import java.awt.*;
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDPageTree;
import java.awt.geom.Rectangle2D;
public class PDFBoxReadFromFile {


  public static void main(String[] args) throws Exception {

    try (PDDocument document = PDdocument.load(new File("C:\\Users\\ed\\Documents\\test2.pdf"))) {

      if (!document.isEncrypted()) {
        PDFTextStripperByArea stripper = new PDFTextStripperByArea();
        stripper.setSortByPosition(true);
        Rectangle2D rect4 = new Rectangle2D.Double(210, 160, 230, 25);
        Rectangle rect1 = new Rectangle(55, 290, 225, 17);

        Rectangle2D rect2 = new Rectangle2D.Double(281, 255, 255, 20);
        Rectangle2D rect3 = new Rectangle2D.Double(2, 365, 660, 1900);
        stripper.addRegion("class2", rect1);
        stripper.addRegion("class3", rect2);
        stripper.addRegion("class4", rect3);
        stripper.addRegion("class5", rect4);

       PDPageTree allPages = document.getDocumentCatalog().getPages();

	                for (int i = 0; i < allPages.getCount(); i++) {
	                    PDPage page = allPages.get(i);
      stripper.extractRegions( page );

        PDPage firstPage = document.getPages().get(0);
        stripper.extractRegions(firstPage);
        System.out.println(stripper.getTextForRegion("class5"));
        System.out.println(stripper.getTextForRegion("class2"));
        System.out.println(stripper.getTextForRegion("class3"));
        System.out.println(stripper.getTextForRegion("class4"));

        File file = new File("C:/Users/ed/eclipse-workspace/pdfboxreadfromfile/file.txt");
        FileWriter fw = new FileWriter(file);
        PrintWriter pw = new PrintWriter(fw);
        pw.println(stripper.getTextForRegion("class5"));
        pw.println(stripper.getTextForRegion("class2"));
        pw.println(stripper.getTextForRegion("class3"));
        pw.println(stripper.getTextForRegion("class4"));
        pw.close();

      }
    } catch (IOException e) {
      System.err.println("Exception while trying to read pdf document - " + e);
    }
  }



Was This Post Helpful? 0
  • +
  • -

#15 NormR   User is online

  • D.I.C Lover
  • member icon

Reputation: 834
  • View blog
  • Posts: 6,402
  • Joined: 25-December 13

Re: All pages PDF stripper

Posted 27 May 2020 - 10:01 AM

What are lines 74 and 75 supposed to do?

Having the PrintWriter creation and close inside of the loop means each time through the loop a new file will be created and written to, overwriting what was written to the last version of the file.
Those statements should be outside of the loop.
-Create file
- loop to write data to file
-close the file

This post has been edited by NormR: 27 May 2020 - 10:02 AM

Was This Post Helpful? 0
  • +
  • -

  • (2 Pages)
  • +
  • 1
  • 2