PDF to Text

Reading a PDF file in and Writing the text back out.

Page 1 of 1

4 Replies - 9244 Views - Last Post: 18 September 2009 - 12:47 PM Rate Topic: -----

#1 woodstock0711  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 7
  • Joined: 17-September 09

PDF to Text

Posted 17 September 2009 - 04:25 PM

I have found a couple of really rudimentry source codes for c and c++ to convert a PDF file to a Text file. However, none of which have enough detail to help me in my quest. What I want to do is to open a PDF file (in rb (read binary)) mode and save the text in exaclty the same way as Acrobat Reader (8.1.3) does it when you select the Save As Text option. The code I found opens the PDF file and saves the text, but one does not create lines of text it simply puts a single text value on each line. So I'm sure I could eventually get it to work as I want, but I am hoping that someone here will have the knowledge or piece of code that will help me shorten my coding time. So please if you can help me, please let me know. However, I would appreciate it if only those who seriouly want to help would reply.

Thank You for taking the time to read this,
Charlie -.-

Is This A Good Question/Topic? 0
  • +

Replies To: PDF to Text

#2 mono15591  Icon User is offline

  • D.I.C Regular

Reputation: 12
  • View blog
  • Posts: 406
  • Joined: 05-November 08

Re: PDF to Text

Posted 17 September 2009 - 04:27 PM

[rules][/rules]
Was This Post Helpful? 0
  • +
  • -

#3 eker676  Icon User is offline

  • Software Engineer
  • member icon

Reputation: 378
  • View blog
  • Posts: 1,833
  • Joined: 18-April 09

Re: PDF to Text

Posted 18 September 2009 - 12:12 PM

Have you saw this article:
http://www.codeproje...actPDFText.aspx

To save it to a file just open up a stream and then write the text to the file.

If you wanted to you could then open the text file and print it line by line to the screen.

The example uses C but a few minor changes and you could convert it to C++.

This post has been edited by eker676: 18 September 2009 - 12:13 PM

Was This Post Helpful? 0
  • +
  • -

#4 woodstock0711  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 7
  • Joined: 17-September 09

Re: PDF to Text

Posted 18 September 2009 - 12:35 PM

As requested here is the code...

size_t CConvertPDFFiletoText::FindStringInBuffer (char* buffer, char* search, size_t buffersize)
{
	char* buffer0 = buffer;

	size_t len = strlen(search);
	bool fnd = false;
	while (!fnd)
	{
		fnd = true;
		for (size_t i=0; i<len; i++)
		{
			if (buffer[i]!=search[i])
			{
				fnd = false;
				break;
			}
		}
		if (fnd) return buffer - buffer0;
		buffer = buffer + 1;
		if (buffer - buffer0 + len >= buffersize) return (size_t)-1;
	}
	return (size_t)-1;
}

//Keep this many previous recent characters for back reference:
#define oldchar 15

//Convert a recent set of characters into a number if there is one.
//Otherwise return -1:
float CConvertPDFFiletoText::ExtractNumber(const char* search, int lastcharoffset)
{
	int i = lastcharoffset;
	while (i>0 && search[i]==' ') i--;
	while (i>0 && (isdigit(search[i]) || search[i]=='.')) i--;
	float flt=-1.0;
	char buffer[oldchar+5]; ZeroMemory(buffer,sizeof(buffer));
	err = strncpy_s(buffer, 1000,search+i+1, lastcharoffset-i);
	if (buffer[0] && sscanf_s(buffer, "%f", &flt))
	{
		return flt;
	}
	return -1.0;
}

//Check if a certain 2 character token just came along (e.g. BT):
bool CConvertPDFFiletoText::seen2(const char* search, char* recent)
{
if (	recent[oldchar-3]==search[0] 
	 && recent[oldchar-2]==search[1] 
	 && (recent[oldchar-1]==' ' || recent[oldchar-1]==0x0d || recent[oldchar-1]==0x0a) 
	 && (recent[oldchar-4]==' ' || recent[oldchar-4]==0x0d || recent[oldchar-4]==0x0a)
	 )
	{
		return true;
	}
	return false;
}

//This method processes an uncompressed Adobe (text) object and extracts text.
void CConvertPDFFiletoText::ProcessOutput(FILE* file, char* output, size_t len)
{
	//Are we currently inside a text object?
	bool intextobject = false;

	//Is the next character literal (e.g. \\ to get a \ character or \( to get ( ):
	bool nextliteral = false;
	
	//() Bracket nesting level. Text appears inside ()
	int rbdepth = 0;

	//Keep previous chars to get extract numbers etc.:
	char oc[oldchar];
	int j=0;
	for (j=0; j<oldchar; j++) oc[j]=' ';

	for (size_t i=0; i<len; i++)
	{
		char c = output[i];
		if (intextobject)
		{
			if (rbdepth==0 && seen2("TD", oc))
			{
				//Positioning.
				//See if a new line has to start or just a tab:
				float num = ExtractNumber(oc,oldchar-5);
				if (num>1.0)
				{
					fputc(0x0d, file);
					fputc(0x0a, file);
				}
				if (num<1.0)
				{
					fputc('\t', file);
				}
			}
			if (rbdepth==0 && seen2("ET", oc))
			{
				//End of a text object, also go to a new line.
				intextobject = false;
				fputc(0x0d, file);
				fputc(0x0a, file);
			}
			else if (c=='(' && rbdepth==0 && !nextliteral) 
			{
				//Start outputting text!
				rbdepth=1;
				//See if a space or tab (>1000) is called for by looking
				//at the number in front of (
				int num = (int)ExtractNumber(oc,oldchar-1);
				if (num>0)
				{
					if (num>1000.0)
					{
						fputc('\t', file);
					}
					else if (num>100.0)
					{
						fputc(' ', file);
					}
				}
			}
			else if (c==')' && rbdepth==1 && !nextliteral) 
			{
				//Stop outputting text
				rbdepth=0;
			}
			else if (rbdepth==1) 
			{
				//Just a normal text character:
				if (c=='\\' && !nextliteral)
				{
					//Only print out next character no matter what. Do not interpret.
					nextliteral = true;
				}
				else
				{
					nextliteral = false;
					if ( ((c>=' ') && (c<='~')) || ((c>=128) && (c<255)) )
					{
						fputc(c, file);
					}
				}
			}
		}
		//Store the recent characters for when we have to go back for a number:
		for (j=0; j<oldchar-1; j++) oc[j]=oc[j+1];
		oc[oldchar-1]=c;
		if (!intextobject)
		{
			if (seen2("BT", oc))
			{
				//Start of a text object:
				intextobject = true;
			}
		}
	}
}

//int _tmain(int argc, _TCHAR* argv[])
void CConvertPDFFiletoText::ConvertPDFFiletoText(const CString & cspThePDFFile)
{
	csPDFFile=cspThePDFFile.Left(strlen(cspThePDFFile)-4);
	csPDFFile+=".txt";

	//Discard existing output:
	FILE* fileo = fopen(csPDFFile, "w");
	if (fileo) fclose(fileo);
	fileo = fopen(csPDFFile, "a");

	//Open the PDF source file:
	FILE* filei = fopen(cspThePDFFile, "rb");

	if (filei && fileo)
	{
		//Get the file length:
		int fseekres = fseek(filei,0, SEEK_END);   //fseek==0 if ok
		long filelen = ftell(filei);
		fseekres = fseek(filei,0, SEEK_SET);

		//Read the entire file into memory (!):
		char* buffer = new char [filelen]; ZeroMemory(buffer, filelen);
		size_t actualread = fread(buffer, filelen, 1 ,filei);  //must return 1

		bool morestreams = true;

		//Now search the buffer repeated for streams of data:
		while (morestreams)
		{
			//Search for stream, endstream. We ought to first check the filter
			//of the object to make sure it if FlateDecode, but skip that for now!
			size_t streamstart = FindStringInBuffer (buffer, "stream", filelen);
			size_t streamend   = FindStringInBuffer (buffer, "endstream", filelen);
			if (streamstart>0 && streamend>streamstart)
			{
				//Skip to beginning and end of the data stream:
				streamstart += 6;
										// 0x0d == Carriage Return	   // 0x0a == New line
				if (buffer[streamstart]==0x0d && buffer[streamstart+1]==0x0a) streamstart+=2;
				else if (buffer[streamstart]==0x0a) streamstart++;

				if (buffer[streamend-2]==0x0d && buffer[streamend-1]==0x0a) streamend-=2;
				else if (buffer[streamend-1]==0x0a) streamend--;

				//Assume output will fit into 10 times input buffer:
				size_t outsize = (streamend - streamstart)*10;
				char* output = new char [outsize]; ZeroMemory(output, outsize);

				//Now use zlib to inflate:
				z_stream zstrm; ZeroMemory(&zstrm, sizeof(zstrm));

				zstrm.avail_in = streamend - streamstart + 1;
				zstrm.avail_out = outsize;
				zstrm.next_in = (Bytef*)(buffer + streamstart);
				zstrm.next_out = (Bytef*)output;

				int rsti = inflateInit(&zstrm);
				if (rsti == Z_OK)
				{
					int rst2 = inflate (&zstrm, Z_FINISH);
					if (rst2 >= 0)
					{
						//Ok, got something, extract the text:
						size_t totout = zstrm.total_out;
						ProcessOutput(fileo, output, totout);
					}
				}
				delete[] output; output=0;
				buffer+= streamend + 7;
				filelen = filelen - (streamend+7);
			}
			else
			{
				morestreams = false;
			}
		}
		fclose(filei);
	}
	if (fileo) fclose(fileo);
}



The PDF I am open does not have the BT (Begin Text?) or ET (End Text?) I assume that's what the symbols translate to. So it is only returning the as single items and placing them on a seperate line. So instead of getting something like:
This is the first line of text.
This is the second line of text.
it gives me:
This

is

the

first

line

of

text.

This

is

the

second

line

of

text.

Thank You again for any help!
Charlie
Was This Post Helpful? 0
  • +
  • -

#5 woodstock0711  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 7
  • Joined: 17-September 09

Re: PDF to Text

Posted 18 September 2009 - 12:47 PM

View Posteker676, on 18 Sep, 2009 - 11:12 AM, said:

Have you saw this article:
http://www.codeproje...actPDFText.aspx

To save it to a file just open up a stream and then write the text to the file.

If you wanted to you could then open the text file and print it line by line to the screen.

The example uses C but a few minor changes and you could convert it to C++.



As you can see from my newest post the link you suggest is the code I am trying to get to work. However, I am looking for some help as the code is looking for the BT and ET symbols in the PDF file and the PDF file I am looking at does not use them. So I need to figure out how to get the Lines of text in the PDF file without the BT ET symbols. I am not sure how to do that with the code I provided and would appreciate any insite as to how to accomplish it from someone who might have already found a solution to this kind of problem. What would I have to change in the code provided to have it write lines of text rather then single words of text on each line?

Thanks for the response.
Charlie
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1