Lately I have been going through a rather large push to do three things. Index as much of the web as I can with 1 50mb/s connection, parse the content and make it searchable, and create a nice back-end for these operations. Thus far the project is going quite nicely. I had to move away from a hosting provider and start hosting the crawler from home after crashing one of their shared hosting servers. (whoops) But since then I have indexed quite a few websites and now just need to go through the data.
There isn't really a purpose to this project, it's more of a because I can kind of thing but I am here to share some statistics with you.
I currently have:
44.5 million pages in the queue.
4.3 million distinct URIs (including parameter arguments)
317 thousand hosts in the queue.
2.4 million websites stored locally
In all of this data I have:
Discovered that 10.7 thousand of the sites are adult based or mention adult based content.
Discovered that 1.6 thousand are using the html5 doc type
Discovered that one must not crawl twitter because that makes up 11.3% of the data stored on the server.
This is what I come home to:
There isn't really a purpose to this project, it's more of a because I can kind of thing but I am here to share some statistics with you.
I currently have:
44.5 million pages in the queue.
4.3 million distinct URIs (including parameter arguments)
317 thousand hosts in the queue.
2.4 million websites stored locally
In all of this data I have:
Discovered that 10.7 thousand of the sites are adult based or mention adult based content.
Discovered that 1.6 thousand are using the html5 doc type
Discovered that one must not crawl twitter because that makes up 11.3% of the data stored on the server.
This is what I come home to:
0 Comments On This Entry
Trackbacks for this entry [ Trackback URL ]
My Blog Links
Recent Entries
-
Crawler is going to be getting a backendon May 14 2012 05:58 PM
-
-
Recent Comments
Search My Blog
0 user(s) viewing
0 Guests
0 member(s)
0 anonymous member(s)
0 member(s)
0 anonymous member(s)
Categories
|
|



Leave Comment










|