I've always marveled at the experts posting stuff about libraries on help sites. "You use this module's method to do this, and etc etc" + <post magical code> = awed reader. I wonder, does this come from experience, from learning the library's documentation, both, Google, or just plain intuition?

I'm ramping up my personal work project to automate testing of PDF's. It's a pain wading through tons of PDFs and checking if something changed. Some problem though: some parts of the PDF will change, other parts should not. What to do?

I've decided to use Python to take this on, and I've succeeded creating a few really helpful scripts but the real big pain still lies in the way. Currently, it takes me at least an hour to do an extensive and rigorous checking, but I imagine if I manage to do this automation thingie, the thing will be done in less than 5 minutes.

Here's what I'm looking at:

1. pyPdf way
- Yep, it's named with that case. With this route, I'll be comparing the text of the PDF's and see if some parts changed. Basically, lots of flags on what parts to check, plus lots of modifications since I'll be setting up per type of PDF on what to ignore and what to compare. Also, this does not solve the part where the elements move in the PDF. So I guess this one, while easier to implement, will still not bear any fruit since the layout can be checked, and I'll still be subjected to checking out the PDF's manually, much to the pain of my eyeballs.

2. PIL(Python Image Library) way
- This one's looking brighter. I convert the PDF's to images first per page, then crop them up to the parts I want to check, then run the whole thing. I'm not sure yet how long it will take for the program to run with lots of PDFs, but this one covers everything from the layout to the text.

From my search for a better way to do this, I stumbled upon a blog of a game developer! How neat. Haha.

Hope I succeed in this project. ^^

