Subscribe to calebj's Blog        RSS Feed

Crawler is going to be getting a backend

Icon Leave Comment
Lately I have been going through a rather large push to do three things. Index as much of the web as I can with 1 50mb/s connection, parse the content and make it searchable, and create a nice back-end for these operations. Thus far the project is going quite nicely. I had to move away from a hosting provider and start hosting the crawler from home after crashing one of their shared hosting servers. (whoops) But since then I have indexed quite a few websites and now just need to go through the data.

There isn't really a purpose to this project, it's more of a because I can kind of thing but I am here to share some statistics with you.

I currently have:
44.5 million pages in the queue.
4.3 million distinct URIs (including parameter arguments)
317 thousand hosts in the queue.
2.4 million websites stored locally

In all of this data I have:
Discovered that 10.7 thousand of the sites are adult based or mention adult based content.
Discovered that 1.6 thousand are using the html5 doc type
Discovered that one must not crawl twitter because that makes up 11.3% of the data stored on the server.

This is what I come home to:
Posted Image

0 Comments On This Entry


Trackbacks for this entry [ Trackback URL ]

There are no Trackbacks for this entry

June 2018

24 25 2627282930

Recent Entries

Recent Comments

Search My Blog

0 user(s) viewing

0 Guests
0 member(s)
0 anonymous member(s)