Having trouble converting pdf to excel using spire.xls for .NET !?

  • (3 Pages)
  • +
  • 1
  • 2
  • 3

39 Replies - 1307 Views - Last Post: 13 September 2018 - 03:54 PM Rate Topic: ****- 1 Votes

#1 TBPU   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 11
  • Joined: 01-September 18

Having trouble converting pdf to excel using spire.xls for .NET !?

Posted 01 September 2018 - 07:15 AM

I'm trying to convert a pdf file to a xlsx file using Spire.XLS for .NET.
sample pdf
wanted output xlsx
The coding I've done is below
//ocr the pdf file and save it as text file and do some replaces

PdfDocument doc = new PdfDocument();
doc.LoadFromFile(@"D:\sample.pdf");
StringBuilder buffer = new StringBuilder();
foreach (PdfPageBase page in doc.Pages)
{
    buffer.Append(page.ExtractText());
}
doc.Close();
String fileName = @"D:\tempchk.txt";
String fileName2 = @"D:\tempchk.csv";
String fileName3 = @"D:\tempchk.xlsx";
File.WriteAllText(fileName, buffer.ToString());
string content=File.ReadAllText(fileName);
content=content.Trim();
content=content.Replace("Evaluation Warning : The document was created with Spire.PDF for .NET.","");
content=Regex.Replace(content,@"[^\S\r\n][^\S\r\n]+",";");
content=Regex.Replace(content,@";(\d+)","$1");
content=content.Replace(";id no","id no");

//save the text file as a csv file

File.WriteAllText(fileName2,content);
Workbook book = new Workbook();
book.LoadFromFile(fileName2, ",", 1, 1);

//converting csv to xlsx

book.SaveToFile(fileName3, ExcelVersion.Version2013);
book.LoadFromFile(fileName3);
Worksheet sheet = book.Worksheets[0];

//deleting blank rows

for (int i = sheet.Rows.Count() - 1; i >= 0; i--)
{
    if (sheet.Rows[i].IsBlank)
    {
        sheet.DeleteRow(i+1);
    }
}

string[] splitText = null;
string text = null;
for (int i = 0; i < sheet.LastRow; i++)
{
    text = sheet.Range[i + 1, 1].Text;
    splitText = text.Split(';');
    for (int j = 0; j < splitText.Length; j++)
    {
        sheet.Range[i + 1, j + 1].Text = splitText[j];
    }
}
book.SaveToFile(@"D:\result.xlsx", ExcelVersion.Version2010);
System.Diagnostics.Process.Start(@"D:\result.xlsx");

But output xlsx file produced by the above code is this

I think I'm doing something wrong in the regex replace where I'm trying to replace the separator whitespaces with semicolon! But cannot figure out a way to resolve it. How do I get the desired result?
Also, is there a way to make my code more efficient?

Is This A Good Question/Topic? 0
  • +

Replies To: Having trouble converting pdf to excel using spire.xls for .NET !?

#2 Skydiver   User is online

  • Code herder
  • member icon

Reputation: 6560
  • View blog
  • Posts: 22,517
  • Joined: 05-May 12

Re: Having trouble converting pdf to excel using spire.xls for .NET !?

Posted 01 September 2018 - 08:20 AM

Where are you trying to replace the separator white spaces with a semicolon?

If it's on line 18, then the caret in the regular expression on line 18 means "not". See character classes.

What is your desired result?

To make your testing easier why not simply set a breakpoint after line 18 or inspect your tempchk.csv file instead of waiting for the full conversion.

As an aside, why all the intermediate writes to external files and then loading back up again only to write back out again?
Was This Post Helpful? 0
  • +
  • -

#3 TBPU   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 11
  • Joined: 01-September 18

Re: Having trouble converting pdf to excel using spire.xls for .NET !?

Posted 01 September 2018 - 08:29 AM

Quote

Where are you trying to replace the separator white spaces with a semicolon?

I'm trying to replace the separator white spaces in the text file D:\tempchk.txt

Quote

What is your desired result?

desired xlsx file
Was This Post Helpful? 0
  • +
  • -

#4 Sheepings   User is offline

  • Senior Programmer
  • member icon

Reputation: 149
  • View blog
  • Posts: 962
  • Joined: 05-December 13

Re: Having trouble converting pdf to excel using spire.xls for .NET !?

Posted 01 September 2018 - 09:34 AM

View PostSkydiver, on 01 September 2018 - 03:20 PM, said:

If it's on line 18, then the caret in the regular expression on line 18 means "not". See character classes.

Your link was bad sky
Fixed link https://docs.microso...aracter-classes
Was This Post Helpful? 1
  • +
  • -

#5 Skydiver   User is online

  • Code herder
  • member icon

Reputation: 6560
  • View blog
  • Posts: 22,517
  • Joined: 05-May 12

Re: Having trouble converting pdf to excel using spire.xls for .NET !?

Posted 01 September 2018 - 10:24 AM

Sheepings: Thanks!

TBPU: Can you post a copy of tempchk.txt?
Was This Post Helpful? 0
  • +
  • -

#6 TBPU   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 11
  • Joined: 01-September 18

Re: Having trouble converting pdf to excel using spire.xls for .NET !?

Posted 01 September 2018 - 05:41 PM

Skydiver: Here is the tempchk.txt
Was This Post Helpful? 0
  • +
  • -

#7 Skydiver   User is online

  • Code herder
  • member icon

Reputation: 6560
  • View blog
  • Posts: 22,517
  • Joined: 05-May 12

Re: Having trouble converting pdf to excel using spire.xls for .NET !?

Posted 01 September 2018 - 06:48 PM

Seems easy enough to get to your final desired .xlsx with less file back and forth. Simply delete leading whitespace on each line. Within each line, replace multiple adjacent whitespace with commas. Only write out non-empty lines to the .csv. Convert .csv to .xlsx. Open your beverage of choice and call it a day.
Was This Post Helpful? 0
  • +
  • -

#8 TBPU   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 11
  • Joined: 01-September 18

Re: Having trouble converting pdf to excel using spire.xls for .NET !?

Posted 01 September 2018 - 07:08 PM

I've tried Trim method but it does not remove the leading white spaces for some reason and I'm also struggling to replace multiple adjacent whitespace with commas.

Can you show me some code please?

This post has been edited by Skydiver: 01 September 2018 - 07:22 PM
Reason for edit:: Removed unnecessary quote. No need to quote the post above yours.

Was This Post Helpful? 0
  • +
  • -

#9 Skydiver   User is online

  • Code herder
  • member icon

Reputation: 6560
  • View blog
  • Posts: 22,517
  • Joined: 05-May 12

Re: Having trouble converting pdf to excel using spire.xls for .NET !?

Posted 01 September 2018 - 07:33 PM

There is no need to quote the post above yours. Just use the big Reply button or the Fast Reply area.

You need to trim on each line. Trim() works on one string at a time. It knows nothing about line breaks within a string. The problem is that you are using ReadAllText() which loads all of the lines of text as an single string. You need to use ReadAllLines() or ReadLines() which returns an array or enumeration of lines where each line is a string. Then as I stated, work on each line one at a time.

Please be cautious about asking for code. That is a quick way to get your thread closed.
Was This Post Helpful? 1
  • +
  • -

#10 Skydiver   User is online

  • Code herder
  • member icon

Reputation: 6560
  • View blog
  • Posts: 22,517
  • Joined: 05-May 12

Re: Having trouble converting pdf to excel using spire.xls for .NET !?

Posted 01 September 2018 - 07:44 PM

As for replacing multiple adjacent spaces, experiment with something like RegExPad.com to see how your expression is matching (or failing to match) various strings.

I personally found learning through experimentation works best for regular expressions instead of being spoonfed an expression, or looking up a regex from some recipe book.

Also, you really don't need regular expressions for this.
Was This Post Helpful? 0
  • +
  • -

#11 TBPU   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 11
  • Joined: 01-September 18

Re: Having trouble converting pdf to excel using spire.xls for .NET !?

Posted 01 September 2018 - 09:44 PM

Hi skydiver, thanks for your suggestions.
This is the latest that I've done
String fileName = @"D:\tempchk.txt";
String fileName2 = @"D:\tempchk.csv";
String fileName3 = @"D:\tempchk.xlsx";

PdfDocument doc = new PdfDocument();
doc.LoadFromFile(@"D:\sample.pdf");

StringBuilder buffer = new StringBuilder();

foreach (PdfPageBase page in doc.Pages)
{
	buffer.Append(page.ExtractText());
}

doc.Close();
File.WriteAllText(fileName, buffer.ToString());

var lines = File.ReadLines(fileName);
List<string> newlines=new List<string>();
foreach (var line in lines)
{
	if (!string.IsNullOrEmpty(line))
	{
		var x = line.Trim();
		x = Regex.Replace(x, @"\s\s+", ";");
		newlines.Add(x);
	}
}
File.WriteAllLines(fileName2, newlines);

Workbook book = new Workbook();

book.LoadFromFile(fileName2, ",", 1, 1);

book.SaveToFile(fileName3, ExcelVersion.Version2013);


book.LoadFromFile(fileName3);
Worksheet sheet = book.Worksheets[0];

for (int i = sheet.Rows.Count() - 1; i >= 0; i--)
{
	if (sheet.Rows[i].IsBlank)
	{
		sheet.DeleteRow(i+1);
	}
}

string[] splitText = null;
string text = null;
for (int i = 0; i < sheet.LastRow; i++)
{
	text = sheet.Range[i + 1, 1].Text;
	splitText = text.Split(';');
	for (int j = 0; j < splitText.Length; j++)
	{
		sheet.Range[i + 1, j + 1].Text = splitText[j];
	}
}
sheet.AllocatedRange.AutoFitColumns();
sheet.AllocatedRange.AutoFitRows();

book.SaveToFile(@"D:\result.xlsx", ExcelVersion.Version2010);
File.Delete(fileName);
File.Delete(fileName2);
File.Delete(fileName3);
System.Diagnostics.Process.Start(@"D:\result.xlsx");

As of now, its seems to work fine but what other changes can I make it to be more efficient?
Was This Post Helpful? 0
  • +
  • -

#12 Skydiver   User is online

  • Code herder
  • member icon

Reputation: 6560
  • View blog
  • Posts: 22,517
  • Joined: 05-May 12

Re: Having trouble converting pdf to excel using spire.xls for .NET !?

Posted 02 September 2018 - 05:42 AM

Why are you replacing the multiple spaces with semicolons only to later to split any cells you find with semicolons into mutiple cells? Doing this against either the real Excel object model, or Spire.xls's object model will be slow. As I said above, just replace the multiple spaces with commas so that it is already split into multiple cells.

As I as also noted, why are you doing multiple rounds trips to the filesystem and back again (e.g. Load as text and save as .csv, then load a csv and save to 2013, and then load 2013 and save as 2010?) Hitting the filesystem is slow. Minimize these round-trips.

And lastly, I also noted that you really don't need regular expressions for this. Regular expressions are slow because what happens in the background is that the .NET Framework takes your regular expression string and generates a finite state machine in C#, then compiles that C# into a temporary assembly, then loads that assembly, and then finally it can perform the regular expression matching. That temporary assembly is only cached for the runtime of your program. The next time you run, it will have to do all that work again. If you simply take time to write the code to search for multiple spaces and replace it with commas, that code will be compiled and ready to run right away.
Was This Post Helpful? 0
  • +
  • -

#13 TBPU   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 11
  • Joined: 01-September 18

Re: Having trouble converting pdf to excel using spire.xls for .NET !?

Posted 02 September 2018 - 05:56 AM

How do I search for multiple spaces and replace it with commas without regex? The count of spaces between all the columns vary... :dontgetit:
Was This Post Helpful? 0
  • +
  • -

#14 Sheepings   User is offline

  • Senior Programmer
  • member icon

Reputation: 149
  • View blog
  • Posts: 962
  • Joined: 05-December 13

Re: Having trouble converting pdf to excel using spire.xls for .NET !?

Posted 02 September 2018 - 06:36 AM

View PostSheepings, on 01 September 2018 - 04:34 PM, said:


Did you read this? TBPU

View PostTBPU, on 02 September 2018 - 02:08 AM, said:

I've tried Trim method but it does not remove the leading white spaces for some reason and I'm also struggling to replace multiple adjacent whitespace with commas.

The solution was already explained to you. You're not listening to him:

View PostTBPU, on 02 September 2018 - 12:56 PM, said:

How do I search for multiple spaces and replace it with commas without regex? The count of spaces between all the columns vary... :dontgetit:/>/>

And:

View PostSkydiver, on 02 September 2018 - 12:42 PM, said:

Why are you replacing the multiple spaces with semicolons only to later to split any cells you find with semicolons into mutiple cells? Doing this against either the real Excel object model, or Spire.xls's object model will be slow. As I said above, just replace the multiple spaces with commas so that it is already split into multiple cells.


Skydiver - I would advocate using regex as a preference for something like this as its quicker and simpler. Curious why wouldn't regex be better? :S

This post has been edited by Sheepings: 02 September 2018 - 07:25 AM

Was This Post Helpful? 0
  • +
  • -

#15 TBPU   User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 11
  • Joined: 01-September 18

Re: Having trouble converting pdf to excel using spire.xls for .NET !?

Posted 02 September 2018 - 07:06 AM

Hi, Sheepings
The link you provided does not open :dontgetit:
Was This Post Helpful? 0
  • +
  • -

  • (3 Pages)
  • +
  • 1
  • 2
  • 3