Making a google scraper to track SEO rankings

Picture this: you are creator of the great base64decode.org. and you want to setup an alert system to see if anyone is trying to steal your #1 spot on google. This guide will show you how to make a scraper to monitor search rankings, complete with JS challenge handling.

The finished example is available here

Starting the project

Create a new automation in Futura

From the dashboard, simply enable developer mode and click the plus button to open the automation creation prompt.

Follow the onboarding to create our module

For this guide, we will only have 1 module, which we will call seomonitor

Start the development environment

In order to quickly iterate in our development process, we will use the live development feature.

To do this, we will first need to login to the dev branch with:

ftr login -a automation_id -b dev

then simply run:

ftr watch

Now our boilerplate code should be deployed and the ftr cli should be waiting for more changes to deploy

Prepare our parameters

Define the module's parameters

First off lets define what parameters this module will need to accomplish its task.

This module will monitor a google search page for ranking changes. This means we will need a search term.

At first thought, a string parameter would work for this. However, for monitoring multiple search terms it would be more convenient to be able to input a list of terms, which can be accomplished by creating and using a basic group.

Create the "search term" basic group type

Basic groups are just implementations of the basicgroupsprotocol.Parsable interface, so lets make type that implements this in searchTermGroup.go.

A "search term" is just a string, so implementing the parsing/serialization is simple:

package seoscraper

import basicgroupsprotocol "github.com/futura-platform/protocol/basicgroups/protocol"

type searchTerm string

// Equals implements basicgroupsprotocol.Parsable.
func (s searchTerm) Equals(s2 searchTerm) bool {
	return s == s2
}

// GetGroupConfig implements basicgroupsprotocol.Parsable.
func (searchTerm) GetGroupConfig() basicgroupsprotocol.GroupConfig {
	return basicgroupsprotocol.GroupConfig{
		EntryTypeSingular: "Search Term",
		EntryTypePlural:   "Search Terms",
		EntryPlaceholder:  "Enter a search term",
		Icon: `<svg xmlns="http://www.w3.org/2000/svg" fill="none" viewBox="0 0 24 24" stroke-width="1.5" stroke="currentColor" class="size-6">
  <path stroke-linecap="round" stroke-linejoin="round" d="m15.75 15.75-2.489-2.489m0 0a3.375 3.375 0 1 0-4.773-4.773 3.375 3.375 0 0 0 4.774 4.774ZM21 12a9 9 0 1 1-18 0 9 9 0 0 1 18 0Z" />
</svg>
`,
	}
}

// ParseEntry implements basicgroupsprotocol.Parsable.
func (searchTerm) ParseEntry(s string) (searchTerm, error) {
	return searchTerm(s), nil
}

// SerializeEntry implements basicgroupsprotocol.Parsable.
func (s searchTerm) SerializeEntry() string {
	return string(s)
}

Accept this basic group as a parameter

Now to accept entries from this basic group as a parameter, we will use the basicgroupsprotocol.EntryProvided[T] type, like so:

type Params struct {
	SearchTerm basicgroupsprotocol.EntryProvided[SearchTerm]
}

Now we have a form input for a search term group:

And a place to manage search term groups:

Implement the steps

Create step 1: Initializing the session

Google has recently stepped up their anti scraping game, likely in response to the rise in LLM adjacent scraping.

This means that in order to make a search, you need to solve a JS challenge first.

Reverse engineering this challenge is one option, but we don't need that kind of performance for our use case, so lets sandbox it with a browser.

Luckily, Futura has a utility for exactly this!

We will make the browser navigate to a random search page, and the JS challenge will be solved automatically.

Once it has navigated, the valid session cookies can be exported for use in our much more performant net client, and the browser can be closed.

package seoscraper

import (
	"context"
	"fmt"
	"math/rand/v2"
	"net/http"
	"net/url"
	"time"

	"github.com/chromedp/cdproto/network"
	"github.com/chromedp/chromedp"
	"github.com/futura-platform/protocol/flowprotocol"
)

func (t *Task) InitializeSession() flowprotocol.TaskStepResult {
	b, cancel, err := t.SpawnSingleTabBrowser(t, t.GetProxy())
	if err != nil {
		return t.ReturnFatalErrorf("failed to spawn browser: %w", err)
	}
	defer cancel()

	u, _ := url.Parse("https://www.google.com/search?" + url.Values{
		"q": []string{fmt.Sprint(rand.Int())},
	}.Encode())

	// retrieve the session cookies
	var cookies []*network.Cookie
	err = chromedp.Run(b.CTX,
		chromedp.Navigate(u.String()),
		chromedp.ActionFunc(func(ctx context.Context) error {
			var err error
			cookies, err = network.GetCookies().Do(ctx)
			return err
		}),
	)
	if err != nil {
		return t.ReturnSmallErrorf("failed to navigate to Google: %w", err)
	}

	httpCookies := make([]*http.Cookie, len(cookies))
	for i, cookie := range cookies {
		httpCookies[i] = &http.Cookie{
			Name:    cookie.Name,
			Value:   cookie.Value,
			Domain:  cookie.Domain,
			Path:    cookie.Path,
			Expires: time.Unix(int64(cookie.Expires), 0),
		}
	}
	t.GetCookieJar().SetCookies(u, httpCookies)

	return t.ReturnBasicStepSuccess()
}

And lets make sure to update our constructor to use this function:

func ConstructTask(base *protocol.Task[Params]) (protocol.BaseTask, []flowprotocol.TaskStep, error) {
	t := &Task{
		Task: base,
	}

	return t,
		[]flowprotocol.TaskStep{
			{StepFunc: t.InitializeSession},
		},
		nil
}

Create step 2: fetch the search page

Now that we have a session prepared, we need to fetch the search page in order to have data to report on.

First, lets add a field to the Task struct to store the search rankings in a slice of *url.URLs. We will call it topLevelSearchResults because we only want the top level results, no nested links like this:

Our task struct will now look like this:

type Task struct {
	*protocol.Task[Params]

	topLevelSearchResults []*url.URL
}

Now lets implement the step for this.

Lets start with a basic GET http request to have the response available:

resp, err := t.Get("https://www.google.com/search?"+url.Values{
	"q": []string{string(*t.Params.SearchTerm)},
}.Encode(), getHeaders())
if err != nil {
	return t.ReturnSmallErrorf("failed to make request: %w", err)
}
defer resp.Body.Close()
if resp.StatusCode != 200 {
	return t.ReturnFatalErrorf("bad status code: %d", resp.StatusCode)
}

Then lets parse the response, possibly handling for another JS challenge:

// parse the response body with goquery
doc, err := goquery.NewDocumentFromReader(resp.Body)
if err != nil {
	return t.ReturnSmallErrorf("failed to parse response body: %w", err)
}

// detect if we received the JS challenge page, if so then retry initialization
if doc.Find("script").FilterFunction(func(i int, s *goquery.Selection) bool {
	return strings.Contains(s.Text(), `SG_SS=`)
}).Length() > 0 {
	r := t.ReturnSmallErrorf("encountered JS challenge page")
	r.NextStepLabel = "InitializeSession"
	return r
}

// find the top-level search result anchors
resultAnchors := doc.Find("a[href]").FilterFunction(func(i int, s *goquery.Selection) bool {
	first := s.Children().First()
	return goquery.NodeName(first) == "h3"
})
t.topLevelSearchResults = make([]*url.URL, resultAnchors.Length())
for i, a := range resultAnchors.EachIter() {
	href, exists := a.Attr("href")
	if !exists {
		return t.ReturnSmallErrorf("search result anchor does not have href attribute")
	}
	u, err := url.Parse(href)
	if err != nil {
		return t.ReturnSmallErrorf("failed to parse search result URL: %w", err)
	}
	t.topLevelSearchResults[i] = u
}

t.SmallSuccessf("fetched %d top-level search results for term '%s'", len(t.topLevelSearchResults), t.Params.SearchTerm)
for _, u := range t.topLevelSearchResults {
	fmt.Println(" -", u.String())
}

return t.ReturnBasicStepSuccess()

Create step 3: report the success, and loop back

Now that we have the data, lets make use of them.

This step will simply detect changes in rank for any of the sites, then post those changes to the webhook if there were any:

package seoscraper

import (
	"fmt"
	"strings"

	"github.com/futura-platform/protocol/flowprotocol"
	"github.com/go-resty/resty/v2"
)

const (
	upSymbol      = "⬆️"
	downSymbol    = "⬇️"
	neutralSymbol = "➖"
	newSymbol     = "🆕"
)

var webhookClient = resty.New()

func (t *Task) ReportResults() flowprotocol.TaskStepResult {
	var report strings.Builder
	didChange := t.lastTopLevelSearchResults == nil
	for i1, latestResult := range t.topLevelSearchResults {
		var rankChange *int
		for i2, lastResult := range t.lastTopLevelSearchResults {
			if latestResult.String() == lastResult.String() {
				delta := i2 - i1
				rankChange = &delta
				break
			}
		}
		statusText := newSymbol
		if rankChange != nil {
			rc := *rankChange
			if rc != 0 {
				didChange = true
			}

			if rc < 0 {
				statusText = downSymbol
			} else if rc > 0 {
				statusText = upSymbol
			} else {
				statusText = neutralSymbol
			}
		}
		report.WriteString(fmt.Sprintf("%s - <%s>\n", statusText, latestResult.String()))
	}

	if didChange {
		// report the change
		resp, err := webhookClient.R().
			SetHeader("Content-Type", "application/json").
			SetBody(map[string]string{
				"content": report.String(),
			}).
			Post(t.Params.ResultsWebhook)
		if err != nil {
			return t.ReturnSmallErrorf("failed to request webhook: %w", err)
		} else if resp.StatusCode() != 204 {
			return t.ReturnSmallErrorf("bad webhook status code %d", resp.StatusCode())
		}
	}

	t.lastTopLevelSearchResults = t.topLevelSearchResults

	// wait and scrape again
	t.Sleep(t.GetErrorDelay())
	return flowprotocol.TaskStepResult{
		NextStepLabel: "FetchSearchResults",
	}
}

The finished example is available here

PreviousPer task HTTP request logs

Last updated 10 months ago

hashtagStarting the project

hashtagCreate a new automation in Futura

hashtagFollow the onboarding to create our module

hashtagStart the development environment

hashtagPrepare our parameters

hashtagDefine the module's parameters

hashtagCreate the "search term" basic group type

hashtagAccept this basic group as a parameter

hashtagImplement the steps

hashtagCreate step 1: Initializing the session

hashtagCreate step 2: fetch the search page

hashtagCreate step 3: report the success, and loop back