Something-driven development

Software development thoughts around Ubuntu, Python, Golang and other tools

A tour of Go – the web crawler exercise

with 2 comments

The last exercise in the Go Tour – parallelizing a web crawler – turned out to be quite a bit more interesting than I’d expected. If anyone has suggested improvements from which I can learn a bit more, or there own solutions posted, let me know – my exercise solution is on github. I’ve tried to stick to the tour content (ie. only using channels rather than the sync package for accessing shared data).

Spoiler Alert: If you are learning Golang and haven’t yet worked through the Go-Tour, go and do so now. If you get stuck, keep struggling, take a break, try again in a few days etc., before looking at other peoples’ solutions.

The solution I ended up with has a Crawl() function very similar to the original, just with two extra function parameters:

func Crawl(url string, depth int, fetcher Fetcher,
	startCrawl func(string) bool, crawlComplete chan string) {

	if depth <= 0 {
		crawlComplete <- url
 		return
	}

	body, urls, err := fetcher.Fetch(url)
	if err != nil {
		fmt.Println(err)
		crawlComplete <- url
 		return
	}

	fmt.Printf("found: %s %q\n", url, body)
	for _, u := range urls {
		if startCrawl(u) {
			go Crawl(u, depth-1, fetcher, startCrawl, crawlComplete)
		}
	}
	crawlComplete <- url
}

The two parameters are:

  • startCrawl func(url string) bool – used as a check before spawning a new ‘go Crawl(url)’ to ensure that we don’t crawl the same url twice.
  • crawlComplete chan string – used to signal that the Crawl function has fetched the page and finished spawning any child go-routines.

These two resources are created and passed in to the initial Crawl() call in the main() function:

func main() {
	startCrawl := make(chan StartCrawlData)
	crawlComplete := make(chan string)
	quitNow := make(chan bool)
	go processCrawls(startCrawl, crawlComplete, quitNow)

	// Returns whether a crawl should be started for a given
	// URL.
	startCrawlFn := func(url string) bool {
		resultChan := make(chan bool)
		startCrawl <- StartCrawlData{url, resultChan}
		return <-resultChan
	}

	Crawl("http://golang.org/", 4, fetcher, startCrawlFn,
		crawlComplete)

	<-quitNow
}

Access to the shared state of which urls have been crawled and when all Crawls() have finished etc., is managed via those channels in the processCrawls() go-routine, so that the main() can simply call the first Crawl() and then wait to quit. I want to check how cheap the temporary creation of a channel is (for the return value of the startCrawlFn above) – I think I saw this method on an earlier GoLang tutorial example, but otherwise I’m happy with the solution :-).

Other solutions to learn from:

About these ads

Written by Michael

November 14, 2012 at 12:32 pm

Posted in golang

2 Responses

Subscribe to comments with RSS.

  1. Hi Mick,
    Here’s my solution for your ‘other solutions’ section. It’s different to both yours and Sonia’s in that I don’t instantiate a single channel. Instead, I use types from the sync package to do all the heavy lifting for me (yes, I know they create channels).

    http://grandiloquentmusings.blogspot.com/2013/12/my-solution-to-go-tutorial-web-crawler.html

    Kim Mason

    December 30, 2013 at 1:08 am

    • Nice – highlighting your changes in green help to see how little you needed to add.

      Michael

      December 30, 2013 at 10:46 am


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 84 other followers

%d bloggers like this: