Radar: Building a High Performance Crawler in Go
For our use-cases the program must meet certain performance requirements to be valuable. It must crawl every zone record in less than a month. In November 2020 the .com zone includes over 350 million domains. It is likely the one with the highest amount of domains and therefore well suited for a calculation of the minimum benchmarking results we aim for.
The shortest month has 28 days = 2.419.200 seconds. 350.000.000 requests / 2.419.200s = 145 requests/s.
client := &http.Client{}
Note to me: See https://golang.org/pkg/crypto/tls/#example_Config_verifyConnection for further optimisations.
There are many things that can go wrong when making HTTP requests. A major part of
x509: certificate is valid for *.parkingcrew.net, parkingcrew.net, not 0--00--0.com
The received certificate is not valid for the requested domain. This is irrelevant for our use-case. We want to request the page anyway. This can be achieved by using a custom HTTP transport and (https://stackoverflow.com/a/46011355)[disabling the verification of insecure certificates].
customTransport := http.DefaultTransport.(*http.Transport).Clone()
customTransport.TLSClientConfig = &tls.Config{InsecureSkipVerify: true}
client := &http.Client{ Transport: customTransport }
Some responses have a content length of -1. According to the documentation of the field "The value -1 indicates that the length is unknown.". Opening a URL with this attribute might still return an HTML response with content. Therefore we cannot rely on it.
I'm not sure yet what caused the content length to be unknown. Otherwise I might be able to prevent it.
2020/11/27 09:38:23 Head "https://0--7.com": dial tcp 75.126.102.252:443: i/o timeout
2020/11/27 09:38:53 Head "https://0--7.com": dial tcp 75.126.102.252:443: i/o timeout
2020/11/27 09:39:23 Head "https://0--7.com": dial tcp 75.126.102.252:443: i/o timeout
A look at the logged errors reveals that our HTTP client is configured to time out after 30 seconds by default. That's too long and results in a benchmarked time of around 540 seconds for the first 100 domains in our list. Based on our initial calculation a full roundtrip can take at most 1000ms / 145 requests = 7 ms/request.
Configuring the client to time out after 500ms results in a reduction to 30s for 100 requests. A fine-grained configuration of timeouts during the HTTP request lifecycle is possible.
client := &http.Client{
Transport: customTransport,
Timeout: 500 * time.Millisecond,
}
Keep in mind that these benchmarks have very low quality, but are still useful to communicate the large impact of our small code changes.