Backlink analysis can be done on a spreadsheet but Python has certain advantages. Learn how to use it and customize reports with this script.
Chances are, you’ve used one of the more popular tools such as Ahrefs or Semrush to analyze your site’s backlinks.
These tools trawl the web to get a list of sites linking to your website with a domain rating and other data describing the quality of your backlinks.
It’s no secret that backlinks play a big part in Google’s algorithm, so it makes sense as a minimum to understand your own site before comparing it with the competition.
While using tools gives you insight into specific metrics, learning to analyze backlinks on your own gives you more flexibility into what it is you’re measuring and how it’s presented.
And although you could do most of the analysis on a spreadsheet, Python has certain advantages.
Other than the sheer number of rows it can handle, it can also more readily look at the statistical side, such as distributions.
In this column, you’ll find step-by-step instructions on how to visualize basic backlink analysis and customize your reports by considering different link attributes using Python.
We’re going to pick a small anomaly detection because it automatically flags any data points that are outside of what the expected outcome was.
How do website from the U.K. furniture sector as an example and walk through some basic analysis using Python.
So what is the value of a site’s backlinks for SEO?
At its simplest, I’d say quality and quantity.
Quality is subjective to the expert yet definitive to Google by way of metrics such as authority and content relevance.
We’ll start by evaluating the link quality with the available data before evaluating the quantity.
Time to code.
We start by importing the data and cleaning up the column names to make it easier to handle and quicker to type for the later stages.
List comprehensions are a powerful and less intensive way to clean up the column names.
The list comprehension instructs Python to convert the column name to lower case for each column (‘col’) in the dataframe’s columns.
Though not strictly necessary, I like having a count column as standard for aggregations and a single value column “project” should I need to group the entire table.
Now we have a dataframe with clean column names.
The next step is to clean the actual table values and make them more useful for analysis.
Make a copy of the previous dataframe and give it a new name.
Clean the dofollow_ref_domains column, which tells us how many ref domains the site linking has.
In this case, we’ll convert the dashes to zeroes and then cast the whole column as a whole number.
First_seen tells us the date the link was first found.
We’ll convert the string to a date format that Python can process and then use this to derive the age of the links later on.
Converting first_seen to a date also means we can perform time aggregations by month and year.
This is useful as it’s not always the case that links for a site will get acquired daily, although it would be nice for my own site if it did!
The link age is calculated by taking today’s date and subtracting the first_seen date.
Then it’s converted to a number format and divided by a huge number to get the number of days.
With the data types cleaned, and some new data features created, the fun can begin!
The first part of our analysis evaluates link quality, which summarizes the whole dataframe using the describe function to get descriptive statistics of all the columns.
So from the above table, we can see the average (mean), the number of referring domains (107), and the variation (the 25th percentile and so on).
The average Domain Rating (equivalent to Moz’s Domain Authority) of referring domains is 27.
Is that a good thing?
In the absence of competitor data to compare in this market sector, it’s hard to know. This is where your experience as an SEO practitioner comes in.
However, I’m certain we could all agree that it could be higher.
How much higher to make a shift is another question.
The table above can be a bit dry and hard to visualize, so we’ll plot a histogram to get an intuitive understanding of the referring domain’s authority.
The distribution is heavily skewed, showing that most of the referring domains have an authority rating of zero.
Beyond zero, the distribution looks fairly uniform, with an equal amount of domains across different levels of authority.
Link age is another important factor for SEO.
Let’s check out the distribution below.
The distribution looks more normal even if it is still skewed with the majority of the links being new.
The most common link age appears to be around 200 days, which is less than a year, suggesting most of the links were acquired recently.
Out of interest, let’s see how this correlates with domain authority.
The plot (along with the 0.19 figure printed above) shows no correlation between the two.
And why should there be?
A correlation would only imply that the higher authority links were acquired in the early phase of the site’s history.
The reason for the non-correlation will become more apparent later on.
We’ll now look at the link quality throughout time.
If we were to literally plot the number of links by date, the time series would look rather messy and less useful as shown below (no code supplied to render the chart).
To achieve this, we will calculate a running average of the Domain Rating by month of the year.
Note the expanding( ) function, which instructs Pandas to include all previous rows with each new row.
We now have a table that we can use to feed the graph and visualize it.
This is quite interesting as it seems the site started off attracting high authority links at the beginning of its time (probably a PR campaign launching the business).
It then faded for four years before reprising with a new link acquisition of high authority links again.
It sounds good just writing that heading!
Who wouldn’t want a large volume of (good) links to their site?
Quality is one thing; volume is another, which is what we’ll analyze next.
Much like the previous operation, we’ll use the expanding function to calculate a cumulative sum of the links acquired to date.
That’s the data, now the graph.
We see that links acquired at the beginning of 2017 slowed down but steadily added over the next four years before accelerating again around March 2021.
Again, it would be good to correlate that with performance.
Of course, the above is just the tip of the iceberg, as it’s a simple exploration of one site. It’s difficult to infer anything useful for improving rankings in competitive Globe Boss’s proven processes to receive more web traffic, customers, and profits. Outrank even your toughest competitors in Google or Bing. We make sure your website is search spaces.
Below are some areas for further data exploration and analysis.
I’m certain there are plenty of ideas not listed above, feel free to share below.
More resources:
Featured Image: metamorworks/Shutterstock
Andreas Voniatis is the Founder of Artios, the SEO systems that help agencies and businesses unlock more ROI from SEO …
Get our daily newsletter from SEJ’s Founder Loren Baker about the latest Google Analytics. Even experienced SEO professionals do. The good news in the industry!
Subscribe to our daily newsletter to get the latest industry [Read full bio]
Subscribe to our daily newsletter to get the latest industry news.
Subscribe to our daily newsletter to get the latest industry here.
Get our daily newsletter from SEJ’s Founder Loren Baker about the latest news.