Google's Distinguished Engineer Matt Cutts announced that Penguin 2.0 went live on May 22 and webmasters braced for the worst. With the release of the much-anticipated algorithm update, many in the industry wondered how bad the damage would actually be.
I've done a lot of Penguin work since April 24, 2012 when Penguin 1.0 first rolled out, and I was fully prepared to begin analyzing the latest update as well. Similar to the approach I used for Penguin 1.0, I began heavily analyzing Penguin 2.0 victims to learn as much as I could about the new algorithm update. You can read my Penguin 2.0 findings in case you're interested in learning more.
When announcing the release of Penguin 2.0, Cutts explained that Penguin 1.0 only analyzed the homepage of a website versus taking all pages into account (hence the "deeper" references you keep hearing about). And based on my analysis of now 17 sites hit by Penguin, I can tell you that's exactly what I'm seeing.
When analyzing the link profiles of websites hit by Penguin 2.0, you can clearly see unnatural links pointing to deeper pages on the site, and not just the homepage. Almost every site I've analyzed followed that pattern.
Unnatural Links to Deeper Pages = More Links To Remove
If you've been hit by Penguin 2.0 (or 1.0), you must heavily analyze your link profile and identify unnatural links to remove. For some sites, this is a daunting task. There are a number of sites I've analyzed that have hundreds of thousands of links to remove (or more). And the more links you need to remove, the harder it is to keep track of your progress.
But even if you analyze, download, and organize those links, how do you know which ones are truly being removed? Sure, you could check them manually, but you might not be done until 2023.
Wouldn't it be great if there was some type of automated way to check the spammy inbound links you are trying to remove? Ah, but there is, and that's exactly what I'm going to show you. Actually, there are two SEO tools that can be incredibly valuable for saving time.
The Frog Helping the Penguin
One of my favorite SEO tools is Screaming Frog. I use it for a number of important tasks related to crawling websites, checking XML sitemaps, flagging crawl errors, checking on-page optimization in bulk, etc. There's not a day that goes by that I don't unleash the frog on a website.
And ever since Penguin 1.0 launched, I've used Screaming Frog for another important task – checking if spammy inbound links are still active. Using Screaming Frog, you can use a custom filter to check for specific html code residing on the page you are crawling. At the end of the crawl, you can see which pages still have that code (or don't have that code), and that can help you save enormous amounts of time.
In addition, my analysis of sites hit by both Penguin 1.0 and 2.0 revealed many websites hit by malware, flagged as attack sites, etc. When checking spammy inbound links, you definitely run the risk of infection.
Using Screaming Frog can help you avoid visiting all of the spammy pages over and over again. It's just another benefit of using this approach.
Note: Cyrus Shepard wrote a great post last week about the disavow tool and mentioned that you can use Screaming Frog to check page removals (if pages that contain spammy links to your site 404). You can definitely do that, but there are times the pages remain and only the links are removed.
The approach I'm providing is meant to reveal which links have been removed from pages that still remain on the web. In other words, the webmasters are removing the links and keeping the page active. That said, you can use both approaches to thoroughly check the list of spammy links leading to your website.
How To Use Screaming Frog to Check Inbound Links
Now that I've explained what you can do, it's time to show you how to do it. I'm sure there are many of you reading this post that could use some mechanism for saving time, while also confirming link removals. Without further ado, let's introduce the frog to the penguin.
1. Analyze, Export, and Organize Your Links
The first step is the hardest, and isn't really part of this tutorial. You'll need to analyze your link profile, identify spammy links, and then export them to Excel.
You can, and should, use a number of tools to analyze your link profile. For example, Majestic SEO, Open Site Explorer, Google Webmaster Tools, Bing Webmaster Tools, etc.
You should download your links, flag unnatural links, and organize them by worksheet.
2. Copy URLs to Text Files
You can use Screaming Frog in "List" mode, which means it will crawl urls you provide in a text file. That's what we'll be doing, so it's important to copy your unnatural links from Excel to a text editor. I use Textpad, but you can copy your urls to any text editor. Each url should be on a separate line.
Tip: If you're dealing with a lot of links, it's much easier to organize them by category. For example, you might have a worksheet for directories, another for comment spam, another for article sites, etc. That will keep your crawls tighter versus trying to crawl all of the links at once.
3. Launch Screaming Frog
Now that you have your text files, you are ready to unleash the frog. Launch Screaming Frog and select "Mode" from the top menu, and then choose "List". Again, we will be providing a list of urls for Screaming Frog to crawl versus letting it crawl a specific domain.
4. Select Your File
When you change Screaming Frog to "List" mode, you can click the "Select File" button to choose your text file. Click the button and find the first text file you want to crawl. Screaming Frog will read the file and provide a preview of the URLs it will crawl. Click OK.
5. Configure the Custom Filter
You might be tempted to simply click "Start" at this point, but don't. We still need to configure the custom filter that will flag any URLs that still have a specific piece of HTML code present on the page.
Click "Configuration" and then "Custom" to open up the Custom Filter Configuration dialog box. This is where you can enter HTML code for Screaming Frog to look for on each page it crawls. You can choose whether to flag URLs that contain or don't contain that HTML code.
We'll use "Contains" for the first filter and enter your website's full domain name (including protocol) in the text box for the HTML code (i.e., http://www.example.com). If the page still contains your full domain name in the source code, then there's a good chance the link is still present. Click "OK" when you are done.
6. Crawl The Flagged Pages
Click "Start" and Screaming Frog will check each of the urls in your list looking for the HTML code we added in the previous step (your domain name). For larger crawls, you can go and crank out other work while the crawl runs, but you can watch in real-time if you want.
Your results will appear in the "Custom" tab at the far right of the Screaming Frog user interface. That's where any urls will show up that match our custom filter.
7. Export The Results
Now that you crawled each of the pages to see if links to your site still remain, you can easily see which ones were caught by Screaming Frog. Then you can easily export the results to a csv file, which can be opened in Excel.
Review the urls that still link to your site, check to see if those webmasters said they would remove the links, and then follow up with them. You might find that you need to go through this process several times during Penguin work. That's fine, since Screaming Frog is doing the heavy lifting.
Congratulations, you just saved yourself a boatload of time, and salvaged some of your sanity as well.
Update Your Spreadsheets and Keep Removing
Now that you're receiving immediate feedback from Screaming Frog about which links were actually removed, it's time to update your spreadsheets. Keep separate spreadsheets by date so you can track your progress over time.
Remember, you want to document all of your hard work to make sure you're accurately tracking removals (and in case you need to file a reconsideration request at some point).
It's important to have a well-structured Excel file that documents which links you flagged, which ones you removed, and which ones were disavowed (if you need to use the disavow tool for any remaining links).
And yes, you probably will need to use the disavow tool for some links. Just try and remove as many as you can manually.
Deep Crawl for More Advanced Checks
As I mentioned earlier, there are some sites in grave condition link-wise. For example, there are some websites I analyzed with hundreds of thousands of spammy links (or more). For situations like this, your list of unnatural links might strain Screaming Frog (and tie up your computer for a long time). That's when I like to use one of my favorite new tools, Deep Crawl.
Deep Crawl is a cloud-based solution for performing large-scale crawls. For SEOs, it's a heavy duty solution for crawling large-scale websites. You can also use regex to test for the presence of content on a webpage as Deep Crawl traverses a website.
What I love about Deep Crawl is that it handles large crawls extremely effectively. In addition, since it's cloud-based, I can customize the crawl settings, schedule it, and unleash it on a website or a list of URLs. Then Deep Crawl emails me when the crawl has been completed.
Here's a screenshot of the "Extraction" filter you can apply to the crawl:
You can enter a regex that will be applied to each page that's crawled. And similar to Screaming Frog, you can upload a text file of URLs to crawl. But since this is cloud-based, and the heavy lifting will be completed by Deep Crawl's servers, you don't have to shy away from large numbers of URLs to test.
Summary – Proof That Frogs Can Help Penguins
Dealing with Penguin can be hard work, and that's especially the case when you have tens of thousands of unnatural links to deal with (or more). SEO tools that can automate some of the necessary, but monotonous tasks, can greatly increase your efficiency. Screaming Frog and Deep Crawl can both help your Penguin-fighting efforts.
Hopefully this post helped you understand how to use Screaming Frog and Deep Crawl to check links removals without having to revisit every page.
If you have been hit hard by Penguin, you should absolutely try both tools out. I think you'll love using them. I do.
The Original Search Marketing Event is Back!
SES Denver (Oct 16) offers an intense day of learning all the critical aspects of search engine optimization (SEO) and paid search advertising (PPC). The mission of SES remains the same as it did from the start - to help you master being found on search engines. Early Bird rates available through Sept 12. Register today!