Something-driven development

Software development thoughts around Ubuntu, Python, Golang and other tools

A tour of Go – the web crawler exercise

with 5 comments

The last exercise in the Go Tour – parallelizing a web crawler – turned out to be quite a bit more interesting than I’d expected. If anyone has suggested improvements from which I can learn a bit more, or their own solutions posted, let me know – my exercise solution is on github. I’ve tried to stick to the tour content (ie. only using channels rather than the sync package for accessing shared data).

Spoiler Alert: If you are learning Golang and haven’t yet worked through the Go-Tour, go and do so now. If you get stuck, keep struggling, take a break, try again in a few days etc., before looking at other peoples’ solutions.

The solution I ended up with has a Crawl() function very similar to the original, just with two extra function parameters:

func Crawl(url string, depth int, fetcher Fetcher,
	startCrawl func(string) bool, crawlComplete chan string) {

	if depth <= 0 {
		crawlComplete <- url
 		return
	}

	body, urls, err := fetcher.Fetch(url)
	if err != nil {
		fmt.Println(err)
		crawlComplete <- url
 		return
	}

	fmt.Printf("found: %s %q\n", url, body)
	for _, u := range urls {
		if startCrawl(u) {
			go Crawl(u, depth-1, fetcher, startCrawl, crawlComplete)
		}
	}
	crawlComplete <- url
}

The two parameters are:

  • startCrawl func(url string) bool – used as a check before spawning a new ‘go Crawl(url)’ to ensure that we don’t crawl the same url twice.
  • crawlComplete chan string – used to signal that the Crawl function has fetched the page and finished spawning any child go-routines.

These two resources are created and passed in to the initial Crawl() call in the main() function:

func main() {
	startCrawl := make(chan StartCrawlData)
	crawlComplete := make(chan string)
	quitNow := make(chan bool)
	go processCrawls(startCrawl, crawlComplete, quitNow)

	// Returns whether a crawl should be started for a given
	// URL.
	startCrawlFn := func(url string) bool {
		resultChan := make(chan bool)
		startCrawl <- StartCrawlData{url, resultChan}
		return <-resultChan
	}

	Crawl("http://golang.org/", 4, fetcher, startCrawlFn,
		crawlComplete)

	<-quitNow
}

Access to the shared state of which urls have been crawled and when all Crawls() have finished etc., is managed via those channels in the processCrawls() go-routine, so that the main() can simply call the first Crawl() and then wait to quit. I want to check how cheap the temporary creation of a channel is (for the return value of the startCrawlFn above) – I think I saw this method on an earlier GoLang tutorial example, but otherwise I’m happy with the solution :-).

Other solutions to learn from:

Written by Michael

November 14, 2012 at 12:32 pm

Posted in golang

5 Responses

Subscribe to comments with RSS.

  1. Hi Mick,
    Here’s my solution for your ‘other solutions’ section. It’s different to both yours and Sonia’s in that I don’t instantiate a single channel. Instead, I use types from the sync package to do all the heavy lifting for me (yes, I know they create channels).

    http://grandiloquentmusings.blogspot.com/2013/12/my-solution-to-go-tutorial-web-crawler.html

    Kim Mason

    December 30, 2013 at 1:08 am

    • Nice – highlighting your changes in green help to see how little you needed to add.

      Michael

      December 30, 2013 at 10:46 am

  2. It’s much easier if you drop the recursion.
    http://cdunn2001.blogspot.com/2014/08/recursive-memoization-with-go.html

    Christopher Dunn

    August 15, 2014 at 1:40 pm

  3. Silly question: Is it possible in your solution that allChecked (on line 72) will be true even when the max depth for each page has not yet been reached, and there are outstanding URLs to be crawled? Could URLs a, b, and c all be in urlsCrawled, and the signal for URL c comes in on the crawlCompleteReceiver so allChecked is true, but there is actually a URL d that is just at line 91 and we haven’t yet read it into the startCrawlReceiver channel so it isn’t in the urlsCrawled map?

    If this isn’t possible, could you please explain to a Go noob why not?

    Rebecca

    October 8, 2015 at 10:59 pm


Leave a comment