Build Automated SEO Audits with Slack+Python

Stefan Neefischer

Posted on 4 Feb 2021

Get notified when a Python SEO audit job returned some issues and attach detailed information in a file to the Slack notification to make actions right away.

Many of you were curious to know how it could be integrated with Slack for seamless notifications and faster resolution of issues. So, without further ado, we delve into this topic.

On this post

Build Automated SEO Audits with Slack+Python

How to set up your own automated SEO monitoring solution in Slack

With three example audit scripts:

Add Settings in your Slack environment to make it possible to send notifications and file uploads
Audit Job #1: “Sitemap Status Code Checker“
Report the number of cases with status codes different than 20x
Attach URL + Bad Status Code as File to the message
Audit Job #2: “Internal Link Checker”
Check all internal links found on the website – report the number of cases with bad status code
Attach file for bad cases with URL where the link was found, the link URL, the link status code and the link anchor text
Audit Job #3: “Missing Meta Description Checker”
Check for missing meta description on all URLs – report the number of cases
Attach URLs with missing meta description as file

We added two more SEO audit scripts into the running example. The message we want to deliver is that you can have a lot of jobs like that – this is a blueprint for building your own SEO monitoring solution. Be creative. Maybe you can set up some monitoring tasks on your competitor websites to monitor exactly what they are doing.

Setting up your monitoring App in Slack

First of all, you need first a running Slack environment of course. Slack has a free plan that should be more than fine for most.

If you have a running Slack Workspace go to this link and create a new app.
Click on “Create new app”.
Enter your app name, e.g. SEO Monitoring, and select your Slack workspace.
After creating the new app, you have to add some features for sending notifications and files to Slack out of your Python script. Go to “OAuth & Permissions”.
Under “Bot Token Scopes” please add the following OAuth Scopes:
files:write
channels:join
chat:write
Click “install to the workspace” and you will see an OAuth Access Token: This is what you need to copy and paste in your Python script.
Nearly done with the Slack part – now just choose a channel where you want to send messages to. Use the “Add Apps” menu item and search for your SEO monitoring app.

3 basic SEO audits written in Python

As already mentioned, this is only a blueprint on how to set up your own audit solution. Add as many check routines as you want. Just change the sitemap URL and add your Slack OAuth Access Token, and you are ready. Here is the code:

# Pemavor.com SEO Monitoring with Slack Notifications
# Author: Stefan Neefischer

import requests
from urllib.request import urlparse, urljoin
from bs4 import BeautifulSoup
import advertools as adv
import sys
import json
import time
import pandas as pd
import warnings
warnings.filterwarnings("ignore")


def slack_notification_message(slack_token,slack_channel,message):
    data = {
        'token': slack_token,
        'channel': slack_channel,    
        'text': message
    }
    url_chat='https://slack.com/api/chat.postMessage'
    response = requests.post(url=url_chat,data=data)

    
def slack_notification_file(slack_token,slack_channel,filename,filetype):
    # link to files.upload method 
    url = "https://slack.com/api/files.upload" 
    querystring = {"token":slack_token}
    payload = { "channels":slack_channel}
    file_upload = { "file":(filename, open(filename, 'rb'),filetype) } 
    headers = { "Content-Type": "multipart/form-data", } 
    response = requests.post(url, data=payload, params=querystring, files=file_upload)
    
    
def getStatuscode(url):
    try:
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'}
        r = requests.get(url, headers=headers, verify=False,timeout=25, allow_redirects=False) # it is faster to only request the header
        soup = BeautifulSoup(r.text)
        metas = soup.find_all('meta')
        description=[ meta.attrs['content'] for meta in metas if 'name' in meta.attrs and meta.attrs['name'] == 'description' ]
        
        if len(description)>0:
            des=1
        else:
            des=-1
        
        return r.status_code,des
    
    except:
        return -1,-1
    
    
def is_valid(url):
    """
    Checks whether `url` is a valid URL.
    """
    parsed = urlparse(url)
    return bool(parsed.netloc) and bool(parsed.scheme)


def get_all_website_links(url):
    """
    Returns all URLs that is found on `url` in which it belongs to the same website
    """
    # all URLs of `url`
    internal_urls = list()
    # domain name of the URL without the protocol
    domain_name = urlparse(url).netloc
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'}
    r_content = requests.get(url, headers=headers, verify=False, timeout=25, allow_redirects=False).content
    soup = BeautifulSoup(r_content, "html.parser")
    for a_tag in soup.findAll("a"):
        
        href = a_tag.attrs.get("href")
        #print(a_tag.string)
        if href == "" or href is None:
            # href empty tag
            continue
        # join the URL if it's relative (not absolute link)
        href = urljoin(url, href)
        parsed_href = urlparse(href)
        # remove URL GET parameters, URL fragments, etc.
        href = parsed_href.scheme + "://" + parsed_href.netloc + parsed_href.path
        if not is_valid(href):
            # not a valid URL
            continue
        if href in internal_urls:
            # already in the set
            continue
        if domain_name not in href:
            # external link
            continue
        
        internal_urls.append([href,a_tag.string])
    return internal_urls



def get_sitemap_urls(site):
    sitemap = adv.sitemap_to_df(site)
    sitemap_urls = sitemap['loc'].dropna().to_list()
    
    return sitemap_urls


def  sitemap_internallink_status_code_checker(site,SLEEP,slack_token,slack_channel):
    print("Start scrapping internal links for all sitemap urls")
    sitemap_urls = get_sitemap_urls(site)
    
    sub_links_dict = dict()
    for url in sitemap_urls:
        sub_links = get_all_website_links(url)
        sub_links_dict[url] = list(sub_links)

    print("checking status code and description")
    scrapped_url=dict()
    description_url=dict()
    url_statuscodes = []
    for link in sub_links_dict.keys():
        int_link_list=sub_links_dict[link]
        for int_link in int_link_list:
            internal_link=int_link[0]
            #print(internal_link)
            linktext=int_link[1]
            #print(linktext) 
            if internal_link in scrapped_url.keys():
                check = [link,internal_link,linktext,scrapped_url[internal_link],description_url[internal_link]]
            else:
                linkstatus,descriptionstatus=getStatuscode(internal_link)
                scrapped_url[internal_link]=linkstatus
                description_url[internal_link]=descriptionstatus
                check = [link,internal_link,linktext,linkstatus,descriptionstatus]
                time.sleep(SLEEP)

            url_statuscodes.append(check)
        
    url_statuscodes_df=pd.DataFrame(url_statuscodes,columns=["url","internal_link","link_text","status_code","description_status"])
    
    #check status code for all sitemap urls
    sitemap_statuscodes=[]
    for url in sitemap_urls:
        if url in scrapped_url.keys():
            check=[url,scrapped_url[url]]
        else:
            linkstatus,descriptionstatus=getStatuscode(url)
            check=[url,linkstatus]
            time.sleep(SLEEP)
        sitemap_statuscodes.append(check)

    sitemap_statuscodes_df=pd.DataFrame(sitemap_statuscodes,columns=["url","status_code"])
                
        
    
    # statitics and then send to slack
    strstatus=""
    df_internallink_status=url_statuscodes_df[url_statuscodes_df["status_code"]!=200]
    if len(df_internallink_status)>0:
        df_internallink_status=df_internallink_status[["url","internal_link","link_text","status_code"]]
        df_internallink_status["status_group"]=(df_internallink_status['status_code'] / 100).astype(int) *100
        for status in df_internallink_status["status_group"].unique():
            ststatus=f'{status}'
            noUrls=len(df_internallink_status[df_internallink_status["status_group"]==status])
            sts=ststatus[:-1] + 'X'
            if sts=='X':
                sts="-1"
            strstatus=f">*{noUrls}* internal link with status code *{sts}*\n" + strstatus
        df_internallink_status=df_internallink_status[["url","internal_link","link_text","status_code"]]
        df_internallink_status.to_csv("internallinks.csv",index=False)
    else:
        strstatus=">*Great news!*, There is no internal links with bad status code\n"

    strdescription=""
    df_description=url_statuscodes_df[url_statuscodes_df["description_status"]==-1]
    if len(df_description)>0:
        df_description=df_description[["internal_link","status_code","description_status"]]
        df_description=df_description.drop_duplicates(subset = ["internal_link"])
        df_description.rename(columns={'internal_link': 'url'}, inplace=True)
        df_description.to_csv("linksdescription.csv",index=False)
        lendesc=len(df_description)
        strdescription=f">*{lendesc}* url that don't have *meta description*.\n"
    else:
        strdescription=">*Great news!*, There is no url that don't have *meta description*\n"
        
    sitemapstatus=""    
    df_sitemap_status=sitemap_statuscodes_df[sitemap_statuscodes_df["status_code"]!=200]
    if len(df_sitemap_status)>0:
        df_sitemap_status=df_sitemap_status[["url","status_code"]]
        df_sitemap_status["status_group"]=(df_sitemap_status['status_code'] / 100).astype(int) *100
        for status in df_sitemap_status["status_group"].unique():
            ststatus=f'{status}'
            noUrls=len(df_sitemap_status[df_sitemap_status["status_group"]==status])
            sts=ststatus[:-1] + 'X'
            if sts=='X':
                sts="-1"
            sitemapstatus=f">*{noUrls}* url with status code *{sts}*\n" + sitemapstatus
        df_sitemap_status=df_sitemap_status[["url","status_code"]]
        df_sitemap_status.to_csv("sitemaplinks.csv",index=False)
    else:
        sitemapstatus=">*Great news!*, There is no url in sitemap with bad status code\n"
        
    if (len(df_sitemap_status) + len(df_internallink_status) + len(df_description))>0:
        message=f"After analysing {site} sitemap: \n"+strstatus+strdescription+sitemapstatus+"For more details see the attachement files."
    else:
        message=f"After analysing {site} sitemap: \n"+strstatus+strdescription+sitemapstatus
        
    print("send slack notifications")
    #send notification to slack
    slack_notification_message(slack_token,slack_channel,message)
    if len(df_sitemap_status)>0:
        slack_notification_file(slack_token,slack_channel,"sitemaplinks.csv","text/csv")
    if len(df_internallink_status)>0:    
        slack_notification_file(slack_token,slack_channel,"internallinks.csv","text/csv")
    if len(df_description)>0:    
        slack_notification_file(slack_token,slack_channel,"linksdescription.csv","text/csv")
       
        
        
    
    
    
    
          
        
# Enter your XML Sitemap
sitemap = "https://www.pemavor.com/sitemap.xml"
SLEEP = 0.5 # Time in seconds the script should wait between requests
#-------------------------------------------------------------------------
# Enter your slack OAutch token here
slack_token = "XXXX-XXXXXXXX-XXXXXX-XXXXXXX"
# Change slack channel to your target one  
slack_channel= "SEO Monitoring"

sitemap_internallink_status_code_checker(sitemap,SLEEP,slack_token,slack_channel)

Where to run and schedule your Python scripts

In real production environments, we recommend hosting your script somewhere in the Cloud. We use Cloud Functions or Cloud Runs that are triggered with Pub/Sub.
A more simple approach is to use a small virtual server that all big web hosting services provide. They normally run on Linux. You can add your Python code there and schedule it using the good old crontab.
If you have fun with hacking around, you could also use a RaspberryPi and run your own Linux based 24×7 home server. It’s affordable (around $60) and small, so you can place and hide it easily somewhere.