A tour of Go – the web crawler exercise
The last exercise in the Go Tour – parallelizing a web crawler – turned out to be quite a bit more interesting than I’d expected. If anyone has suggested improvements from which I can learn a bit more, or their own solutions posted, let me know – my exercise solution is on github. I’ve tried to stick to the tour content (ie. only using channels rather than the sync package for accessing shared data).
Spoiler Alert: If you are learning Golang and haven’t yet worked through the Go-Tour, go and do so now. If you get stuck, keep struggling, take a break, try again in a few days etc., before looking at other peoples’ solutions.
The solution I ended up with has a Crawl() function very similar to the original, just with two extra function parameters:
func Crawl(url string, depth int, fetcher Fetcher, startCrawl func(string) bool, crawlComplete chan string) { if depth <= 0 { crawlComplete <- url return } body, urls, err := fetcher.Fetch(url) if err != nil { fmt.Println(err) crawlComplete <- url return } fmt.Printf("found: %s %q\n", url, body) for _, u := range urls { if startCrawl(u) { go Crawl(u, depth-1, fetcher, startCrawl, crawlComplete) } } crawlComplete <- url }
The two parameters are:
- startCrawl func(url string) bool – used as a check before spawning a new ‘go Crawl(url)’ to ensure that we don’t crawl the same url twice.
- crawlComplete chan string – used to signal that the Crawl function has fetched the page and finished spawning any child go-routines.
These two resources are created and passed in to the initial Crawl() call in the main() function:
func main() { startCrawl := make(chan StartCrawlData) crawlComplete := make(chan string) quitNow := make(chan bool) go processCrawls(startCrawl, crawlComplete, quitNow) // Returns whether a crawl should be started for a given // URL. startCrawlFn := func(url string) bool { resultChan := make(chan bool) startCrawl <- StartCrawlData{url, resultChan} return <-resultChan } Crawl("http://golang.org/", 4, fetcher, startCrawlFn, crawlComplete) <-quitNow }
Access to the shared state of which urls have been crawled and when all Crawls() have finished etc., is managed via those channels in the processCrawls() go-routine, so that the main() can simply call the first Crawl() and then wait to quit. I want to check how cheap the temporary creation of a channel is (for the return value of the startCrawlFn above) – I think I saw this method on an earlier GoLang tutorial example, but otherwise I’m happy with the solution :-).
Other solutions to learn from:
- Sonia Codes’ solution – sharing memory by communicating
- Kim Mason’s solution using sync package
- [I’ll add more as I find them, now that I’m finished my own solution :-)]
Hi Mick,
Here’s my solution for your ‘other solutions’ section. It’s different to both yours and Sonia’s in that I don’t instantiate a single channel. Instead, I use types from the sync package to do all the heavy lifting for me (yes, I know they create channels).
http://grandiloquentmusings.blogspot.com/2013/12/my-solution-to-go-tutorial-web-crawler.html
Kim Mason
December 30, 2013 at 1:08 am
Nice – highlighting your changes in green help to see how little you needed to add.
Michael
December 30, 2013 at 10:46 am
It’s much easier if you drop the recursion.
http://cdunn2001.blogspot.com/2014/08/recursive-memoization-with-go.html
Christopher Dunn
August 15, 2014 at 1:40 pm
Silly question: Is it possible in your solution that allChecked (on line 72) will be true even when the max depth for each page has not yet been reached, and there are outstanding URLs to be crawled? Could URLs a, b, and c all be in urlsCrawled, and the signal for URL c comes in on the crawlCompleteReceiver so allChecked is true, but there is actually a URL d that is just at line 91 and we haven’t yet read it into the startCrawlReceiver channel so it isn’t in the urlsCrawled map?
If this isn’t possible, could you please explain to a Go noob why not?
Rebecca
October 8, 2015 at 10:59 pm
Hi Rebecca. Not a silly question and you’re absolutely correct. Arnaud brought up the same point as a github issue [1] – but I’ve not taken the time to get back and have another look.
[1] https://github.com/absoludity/go-tour-exercises/issues/1
Michael
October 9, 2015 at 4:34 am