11 Replies - 1113 Views - Last Post: 14 April 2015 - 12:38 PM

#1 dstin44  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 6
  • Joined: 13-April 15

What may be the best way to scrape title tags from large websites?

Posted 13 April 2015 - 02:28 AM

I am working on a task of scraping title tags (keywords) from sites with 25 or up to 50 million pages each. It could be 5 pages, but the number of domains can go up to 20 or more (probably not). I've been researching this topic for quite some time, and I am not sure about what would be the best solution for this. People recommend Java (Seo Spider by ScramingFrog.co.uk and Xenu's Link Sleuth are two programs for this, and they are written in Java, but they are not good for sites of this size), PHP, script on a good Linux server, Google Chrome Console, and addons for this and also Firefox (I am not sure if browsers would be good for this).

In general, there is hardware (good, advanced), it (in general), programming languages, websites / Internet, and I am still wondering what would be the best way to go about this task. I know that good hardware may be useful too. Brand new pc (lets say), Windows 10, good 8 core processor (I am not sure if this part is needed / necessary), 32GB+ RAM, SDD drive (I know that this one is needed), fast / top speed Internet connection.

What would you recommend as far as researching, and working on this task. What would be the best way to go - speed, price, results.

Thanks.

Is This A Good Question/Topic? 0
  • +

Replies To: What may be the best way to scrape title tags from large websites?

#2 frazereastm  Icon User is offline

  • D.I.C Head

Reputation: 7
  • View blog
  • Posts: 88
  • Joined: 03-December 14

Re: What may be the best way to scrape title tags from large websites?

Posted 13 April 2015 - 03:03 AM

Hi, this topic is in the wrong place on the forum and I'm sure it will be moved shortly to somewhere where you will receive the help you need.
Was This Post Helpful? 0
  • +
  • -

#3 ArtificialSoldier  Icon User is offline

  • D.I.C Lover
  • member icon

Reputation: 1842
  • View blog
  • Posts: 5,793
  • Joined: 15-January 14

Re: What may be the best way to scrape title tags from large websites?

Posted 13 April 2015 - 10:37 AM

I'll move this to software development, because the language you'll use will probably be a generic high-level language rather than something like PHP or another web-development language.

In general, you're talking about a spider to follow the various links on pages. A commercial spider application will probably allow you to write code to handle the "payload", what to do for each page (which in your case would be getting certain tags). As far as hardware goes, the general rules apply - the faster the better. This is a CPU-intensive application, that's where you'll get the most bang for your buck. The bottleneck is going to be network throughput though.
Was This Post Helpful? 0
  • +
  • -

#4 dstin44  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 6
  • Joined: 13-April 15

Re: What may be the best way to scrape title tags from large websites?

Posted 14 April 2015 - 12:35 AM

I am still wondering about how this would work, as far as mapping the structure of the sites / getting all pages. You mentioned following links on pages of a site. Would it mean, that a page need to be linked to from some other page of a website, in order to be found / recognized, or not necessarily? I know that there are sitemap programs and websites out there, which create sitemaps, as the name implies. I think that there may be a way to get 100% of the pages of a domain, but I am not sure.

You also mentioned top-level programming language. Would this be something like C++. Also, I wont have thousands of Dollars for it. I am trying to find a way to do it relatively cheap too. However, the most important thing would be to know how to do it, have the info, so I can go from there.

Thanks.

This post has been edited by Skydiver: 14 April 2015 - 05:53 AM
Reason for edit:: Removed unnecessary quote. No need to quote the message above yours.

Was This Post Helpful? 0
  • +
  • -

#5 cfoley  Icon User is offline

  • Cabbage
  • member icon

Reputation: 2386
  • View blog
  • Posts: 5,009
  • Joined: 11-December 07

Re: What may be the best way to scrape title tags from large websites?

Posted 14 April 2015 - 01:03 AM

Have you tried just using one of the out-of-the box solutions you suggested? If you have, what issues did you have? Maybe we can help troubleshoot. If not, it's a good place to start. Often, you'll find that you don't need to make anything new.
Was This Post Helpful? 0
  • +
  • -

#6 dstin44  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 6
  • Joined: 13-April 15

Re: What may be the best way to scrape title tags from large websites?

Posted 14 April 2015 - 01:35 AM

Two programs for it are:

Seo Spider by ScreamingFrog.co.uk, and Xenu's Link Sleuth. People from Seo Spider told me that it may be / it is not good for websites of this size, and that they are working on a new version, which will work differently, and it will be better for this.

Xenu's Link Sleuth sounds like it should work, from what people are saying, but it is not perfect too. The program is old (however, it looks like it was last updated in 2013 - http://prntscr.com/6tjpz0). For Xenu good hardware should improve things, but I am not sure. It still looks like it would be far away from scraping websites with 25 millions of pages or more. One million pages should be doable, fairly easy with good hardware, but more than that seems to be impossible or very hard. The author of the program recommended saving often by hand, but I am not able to work with this program on my pc (and it is not bad, 3.40 GHz processor, 8GB RAM).

I think that the new version of Seo Spider + good hardware and fast Internet connection would do the trick, but this program my not be ready for another 10 or so months, if not longer.

This post has been edited by Skydiver: 14 April 2015 - 05:53 AM
Reason for edit:: Removed unnecessary quote. No need to quote the message above yours.

Was This Post Helpful? 0
  • +
  • -

#7 dstin44  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 6
  • Joined: 13-April 15

Re: What may be the best way to scrape title tags from large websites?

Posted 14 April 2015 - 02:07 AM

http://moz.com/commu...temap-generator

I am not able to edit the previous reply, but according to the link (this is my post / question) people seem to be getting around the limitations of these two programs. AuditMyPc.com Sitemap Generator seems like a very good tool too, it is fast. It starts hanging up after certain amount of urls, but this may be something that can be handled / fixed. With complete list of urls, I could scrape title tags with something like Scrapebox.
Was This Post Helpful? 0
  • +
  • -

#8 cfoley  Icon User is offline

  • Cabbage
  • member icon

Reputation: 2386
  • View blog
  • Posts: 5,009
  • Joined: 11-December 07

Re: What may be the best way to scrape title tags from large websites?

Posted 14 April 2015 - 02:34 AM

What I really meant is what happens to you when you try one.

Quote

People from Seo Spider told me that it may be / it is not good for websites of this size


This sounds vague and non-committal. Maybe it won't work at all but maybe it'll just be a little slower. Download the thing, point it at a website and see what happens.

Then download a couple of others, point them at something and see what happens.

Then you'll know what the real issues for you are for your use case.



If I were designing something to work on a large scale, I would want it to be fault tolerant. That means if it crashes I'd want to be able to start from where I left off. I'd want text files (maybe databases) with all the pages still to download, and all the pages already downloaded. I would want somewhere to store pages that have been downloaded but not processed (maybe temporary) and I'd want a database to relate the page URL to its keywords, and anything else you need. Maybe there are more intermediate stages that should be output.

Storing partial results like this means that you don't need one big program to do it all. You can have something to download pages, something else to extract the keywords and links, something else to put them in the database, and so on. Maybe a lot of these parts already exist.

As well as being able to restart, you'll be able to see what goes wrong. You'll be able to replace poorly performing parts with something better, and if you decide you need a distributed system, it'll be a lot easier to split things up and share work around.

But the first thing is to try existing solutions and see what happens.
Was This Post Helpful? 0
  • +
  • -

#9 modi123_1  Icon User is offline

  • Suitor #2
  • member icon



Reputation: 13566
  • View blog
  • Posts: 54,125
  • Joined: 12-June 08

Re: What may be the best way to scrape title tags from large websites?

Posted 14 April 2015 - 06:48 AM

Just to make a note - those apps you mentioned are typically for *your* site and *your* content. Bastardizing them into slashing through 'millions of pages' for some sort of blackhat SEO business does not seem on the up and up.
Was This Post Helpful? 0
  • +
  • -

#10 dstin44  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 6
  • Joined: 13-April 15

Re: What may be the best way to scrape title tags from large websites?

Posted 14 April 2015 - 11:44 AM

View Postcfoley, on 14 April 2015 - 02:34 AM, said:

But the first thing is to try existing solutions and see what happens.


I tried the existing solutions, and the did not work. I was able to talk directly to authors / owners, after that. People from Scraming Frog (Seo Spider) told me that this is not a good tool for sites of this size, and that they will have a better version coming out, which will not be storing everything in RAM, but to the hard drive. Person from Xenu's Link Sleuth told me to save often, did not really tell me that it is not good for last websites, or maybe did, I don't remember. I think that good hardware (a lot of RAM, SDD drive) may cause a lot if improvement for working with Xenu's Link Sleuth.

Scraping all title tags, or getting all urls of large websites is definitely not an easy task, still / at this point. I've been researching it for almost a month now. Should be very doable to me... Everything can be done programmatically, computer-wise, I would say, in general.
Was This Post Helpful? 0
  • +
  • -

#11 Martyr2  Icon User is offline

  • Programming Theoretician
  • member icon

Reputation: 5078
  • View blog
  • Posts: 13,707
  • Joined: 18-April 07

Re: What may be the best way to scrape title tags from large websites?

Posted 14 April 2015 - 12:11 PM

You do realize you can create a small program which makes a call out to pages and filters them, puts them in a queue or db or whatever all with a few tcp connections right? Written in C or something close to the metal will give you the best performance out of the language. Then you bump up the ram and processor power and let it fly. As already mentioned to you the trick here is the IO. You could put this on a cloud like Amazon or Google and get a good throughput too plus have multiple instances working in parallel.

So I am not sure if you are here to learn how to write such a program or shop around for the best program. If you are here just to shop around and compare platforms and find recommendations, you might want to save your time. If you are interested in building the software, then maybe you can ask programming specific questions.
Was This Post Helpful? 0
  • +
  • -

#12 dstin44  Icon User is offline

  • New D.I.C Head

Reputation: 0
  • View blog
  • Posts: 6
  • Joined: 13-April 15

Re: What may be the best way to scrape title tags from large websites?

Posted 14 April 2015 - 12:38 PM

View PostMartyr2, on 14 April 2015 - 12:11 PM, said:

You do realize you can create a small program which makes a call out to pages and filters them, puts them in a queue or db or whatever all with a few tcp connections right? Written in C or something close to the metal will give you the best performance out of the language. Then you bump up the ram and processor power and let it fly. As already mentioned to you the trick here is the IO. You could put this on a cloud like Amazon or Google and get a good throughput too plus have multiple instances working in parallel.

So I am not sure if you are here to learn how to write such a program or shop around for the best program. If you are here just to shop around and compare platforms and find recommendations, you might want to save your time. If you are interested in building the software, then maybe you can ask programming specific questions.


This sounds good. Somebody mentioned Amazon already, I was thinking about that. Would you be able to create such a program? How much would it cost?

Thanks.
Was This Post Helpful? 0
  • +
  • -

Page 1 of 1