Python URL grabber!

April 27th, 2009
by Serinox

I’m experimenting with building a web crawler in python (maybe even a small search engine bot) and I thought I would post what I have so far. Basically you give it a starting url (in the PendingURL array) and it goes to that site, uses regular expressions (some of the expressions might not be the best ones to use but they work for now) to find html links then extracts the url and adds them to the PendingURL array. The provided sample stops when its checked 10 urls but its very easy to change.

Currently If there is a issue with a link (e.g. cannot connect, cant find server) it just ignores it and moves on. I’ll probably change that later on. But for right now it’s a nifty tool can grab quite a few urls off of just a few pages. When its done running it outputs all of the links to a text file.

This code requires httplib2 and the regular expression modules.

import httplib2
import re

URLList = []
PendingURL = ["STARTING URL"]

RegLinks = "((<a).*?(href).*?(>).*?())"
RegURL = '.*?((?:http|https)(?::\\/{2}[\\w]+)(?:[\\/|\\.]?)(?:[^\\s"]*))'
RegLinks = re.compile(RegLinks, re.IGNORECASE|re.DOTALL|re.MULTILINE)
RegURL = re.compile(RegURL, re.IGNORECASE|re.DOTALL|re.MULTILINE)

def LoadHttp (url):
    global URLList
    global PendingURL
    PendingURL.remove(url)
    H = httplib2.Http(".cache")
    resp,  content  = H.request(url, "GET")
    URLList.append(url)
    return RegLinks.findall(content)

def GetUrls (Result):
    global PendingURL
    for R in range(len(Result)):
        URL = RegURL.findall(Result[R][0])
        if URL <> []:
            if URL not in PendingURL:
                PendingURL.append( URL[0])

while (len(URLList) < 10):
    try:
        GetUrls(LoadHttp(PendingURL[0]))
    except:
        print "Something broken, Ignoring it cuse I'm lazy"

for A in PendingURL:
    if A not in PendingURL:
        URLList.append(A)

file = open("urls.txt", "w")
for U in URLList:
    file.write(U+"\r\n")
file.close()

Posted in Software | Comments (0)

No comments yet

Leave a Reply

You must be logged in to post a comment.