MGWM

The Complete Guide to Screaming Frog Custom Extraction with XPath & Regex

Table of Contents

In this guide, I’ll show you how to use Screaming Frog’s Custom Extraction feature to scrape schema markup, HTML, inline JavaScript and more using XPath and regex. I’ve included plenty of Custom Extraction examples that you can copy or modify for your own Screaming Frog crawls.

Why Use the Custom Extraction Feature?

Screaming Frog scrapes a lot of critical information by default; page titles, H1 elements, canonical tags, etc. But what if you want to pull other data points into your site crawls?

With Custom Extraction, you can program Screaming Frog to scrape just about any information you want. Once you get a handle on how to use it, you can conduct more advanced site crawls and analysis.

For example, here are a few ways that I’ve used Custom Extraction to develop insights and recommendations for my clients’ SEO strategies:

  • Extract publish date to analyze the SEO performance of content by age
  • Extract the comment count of blog articles to show the client which topics drive the most engagement
  • Extract the product availability property from an ecommerce site’s schema markup to help understand how Google was indexing out of stock products

How to Use Screaming Frog Custom Extraction

You can access the Custom Extraction feature in the Configuration dropdown under Custom > Extraction.

Next, you set up your extraction rules. This guide will give you all the instruction you’ll need to create your own.

screaming frog custom extraction rules

  • Extractor Name: The name will appear as the column header for your custom extractions
  • Extraction Method: Choose XPath, Regex or CSSPath
  • Rule: This is where you’ll enter the XPath or regex syntax that we cover in this guide
  • Extraction Filter: Choose Extract Inner HTML, Extract HTML Element, Extract Text, or Extract Function Value (note that these options are not available with regex)

As you run crawls using Custom Extraction, you may find yourself toggling between the Extraction Filter options in order to arrive at the format you want for your data. Several of the examples covered in this guide require a particular Extraction Filter to be selected.

The data you extract is available in the Custom tab. Set the Filter dropdown to Extraction.

custom extraction tab in screaming frog interface

It’s also available as a column in the Internal tab alongside all of the default fields that Screaming Frog populates.

Custom Extraction with XPath

What is XPath?

XPath stands for XML Path Language. XPath can be used to navigate through elements and attributes in an XML document.

Source

When to Use XPath

Use XPath to extract any HTML element of a webpage. If you want to scrape information contained in a div, span, p, heading tag or really any other HTML element, then go with XPath.

Google Chrome has a feature that makes writing XPath easier. Using the Inspect tool, you can right-click on any element and copy the XPath syntax. It’ll often be the case that you’ll need to modify what Chrome gives you before pasting the XPath into Screaming Frog, but it at least gets you started.

screaming frog inspect element xpath

Basic Syntax for XPath Web Scraping

Here is the basic syntax for XPath web scraping:

Syntax Function
// Search anywhere in the document
/ Search within the root
@ Select a specific attribute of an element
* Wildcard, used to select any element
[ ] Find a specific element
. Specifies the current element
.. Specifies the parent element

Here are common XPath functions:

Operator Function
starts-with(x,y) Checks if x starts with y
contains(x,y) Checks if x contains y
last() Finds the last item in a set
count(XPath) Counts occurrences of the XPath extraction

XPath Custom Extraction Examples

In the tables below, you can copy the syntax in the XPath column and paste it into Screaming Frog to perform the extraction described in the Output column. Tweak the syntax as you'd like in order to customize the extraction to your needs.

How to Extract Common HTML Elements

XPath Output
//h1 Extract all H1 tags
//h3[1] Extract the first H3 tag
//h3[2] Extract the second H3 tag
//div/p Extract any <p> contained within a <div>
//div[@class='author'] Extract any <div> with class "author"
//p[@class='bio'] Extract any <p> with class "bio"
//*[@class='bio'] Extract any element with class "bio"
//ul/li[last()] Extract the last <li> in a <ul>
//ol[@class='cat']/li[1] Extract the first <li> in a <ol> with class "cat"
count(//h2) Count the number of H2's (set extraction filter to "Function Value")
//a[contains(.,'click here')] Extract any link with anchor text containing "click here"
//a[starts-with(@title,'Written by')] Extract any link with a title starting with "Written by"

 

How to Extract Common HTML Attributes

XPath Output
//@href Extract all links
//a[starts-with(@href,'mailto')]/@href Extract link that starts with “mailto” (email address)
//img/@src Extract all image source URLs
//img[contains(@class,'aligncenter')]/@src Extract all image source URLs for images with the class name containing “aligncenter”
//link[@rel='alternate'] Extract elements with the rel attribute set to “alternate”
//@hreflang Extract all hreflang values

 

How to Extract Meta Tags (including Open Graph and Twitter Cards)

I recommend setting the extraction filter to “Extract Inner HTML” for these ones.

 

Extract Meta Tags:

XPath Output
//meta[@property='article:published_time']/@content Extract the article publish date (commonly-found meta tag on WordPress websites)

Extract Open Graph:

XPath Output
//meta[@property='og:type']/@content Extract the Open Graph type object
//meta[@property='og:image']/@content Extract the Open Graph featured image URL
//meta[@property='og:updated_time']/@content Extract the Open Graph updated time

Extract Twitter Cards:

XPath Output
//meta[@name='twitter:card']/@content Extract the Twitter Card type
//meta[@name='twitter:title']/@content Extract the Twitter Card title
//meta[@name='twitter:site']/@content Extract the Twitter Card site object (Twitter handle)

How to Extract Schema Markup in Microdata Format

These XPath rules can be used when a website’s schema markup is in microdata format, like this:

If it’s in JSON-LD format, then jump to the section on how to extract schema markup with regex.

 

Extract Schema Types:

XPath Output
//*[@itemtype]/@itemtype Extract all of the types of schema markup on a page

Example Screaming Frog Custom Extraction:

Extract Breadcrumb Schema:

XPath Output
//*[contains(@itemtype,'BreadcrumbList')]/*[@itemprop]/a/@href Extract all breadcrumb links
//*[contains(@itemtype,'BreadcrumbList')]/*[@itemprop][1]/a/@href Extract the first breadcrumb link
//*[contains(@itemtype,'BreadcrumbList')]/*[@itemprop] Extract breadcrumb names (set extraction filter to “Extract Text”)
count(//*[contains(@itemtype,'BreadcrumbList')]/*[@itemprop]) Count the number of breadcrumb list items (set extraction filter to “Function Value”)

Example Screaming Frog Custom Extraction:

 

Extract Product Schema:

XPath Output
//*[@itemprop='name']/@content Extract product name
//*[@itemprop='description']/@content Extract product description
//*[@itemprop='price']/@content Extract product price
//*[@itemprop='priceCurrency']/@content Extract product currency
//*[@itemprop='availability']/@href Extract product availability
//*[@itemprop='sku']/@content Extract product SKU

Example Screaming Frog Custom Extraction:

Extract Review Schema:

XPath Output
//*[@itemprop='reviewCount'] Extract review count
//*[@itemprop='ratingValue'] Extract rating value
//*[@itemprop='bestRating'] Extract best review rating
//*[@itemprop='review']/*[@itemprop='name'] Extract review name
//*[@itemprop='review']/*[@itemprop='author'] Extract review author
//*[@itemprop='review']/*[@itemprop='datePublished']/@content Extract the publish date of reviews
//*[@itemprop='review']/*[@itemprop='reviewBody'] Extract the body content of reviews

Example Screaming Frog Custom Extraction:

 

Extract Local Business & Organization Schema:

XPath Output
//*[contains(@itemtype,'Organization')]/*[@itemprop='name'] Extract the organization's name
//*[@itemprop='address']/*[@itemprop='streetAddress'] Extract the street address
//*[@itemprop='address']/*[@itemprop='addressLocality'] Extract the address locality
//*[@itemprop='address']/*[@itemprop='addressRegion'] Extract the address region
//*[@itemprop='telephone'] Extract the telephone number
//*[@itemprop='sameAs']/@href Extract the "sameAs" links

Example Screaming Frog Custom Extraction:

 

Extract Article Schema:

XPath Output
//*[contains(@itemtype,'Article')]/*[@itemprop='headline'] Extract the article headline
//*[@itemprop='author']/*[@itemprop='name']/@content Extract author name
//*[@itemprop='publisher']/*[@itemprop='name']/@content Extract publisher name
//*[@itemprop='datePublished']/@content Extract publish date
//*[@itemprop='dateModified']/@content Extract modified date

Example Screaming Frog Custom Extraction:

Custom Extraction with Regex

What is regex?

A regular expression (regex) is a sequence of characters that define a search pattern. Usually this pattern is used by string searching algorithms for "find" or "find and replace" operations on strings...

Source

When to Use Regex

Where XPath can extract HTML, it stops short of being able to extract inline JavaScript. This is where knowing regex comes in handy.

For example, with regex you can extract schema markup that’s in JSON-LD format. You can extract data out of tracking scripts, like scraping a web page’s Google Analytics tracking ID.

Basic Syntax for Regex

Regex can be complex and confusing, especially when you’re first learning it. This guide isn’t intended to teach you regex, so I recommend starting with this resource aimed at teaching marketers the basics of regex (their examples are specific to Google Analytics, which I think makes regex easier for marketers to grasp).

Next, I suggest reading through Google’s regex guide, which includes more helpful examples.

If you simply need a refresh, here’s a cheat sheet of the regex metacharacters:

Wildcards

Syntax Function
. Match any 1 character
* Match preceding character 0 or more times
? Match preceding character 0 or 1 time
+ Match preceding character 1 or more times
| OR

Anchors

Syntax Function
^ String begins with the succeeding character
$ String ends with the preceding character

Groups

Syntax Function
( ) Match enclosed characters in exact order
[ ] Match enclosed characters in any order
- Match any characters within the specified range

Escape

Syntax Function
\ Treat character literally, not as regex

Regex Custom Extraction Examples

In the tables below, you can copy the syntax in the Regex column and paste it into Screaming Frog to perform the extraction described in the Output column. Tweak the syntax as you'd like in order to customize the extraction to your needs.

How to Extract Inline JavaScript

With regex, you can extract any code contained within <script> tags. For marketers, this means you can extract information like clients’ tracking ID’s used with their analytics or advertising platforms. Here are a few examples of that:

Regex Output
["'](UA-.*?)["'] Extract the Google Analytics tracking ID
["'](AW-.*?)["'] Extract the Google Ads conversion ID and/or remarketing tag
["'](GTM-.*?)["'] Extract the Google Tag Manager and/or Google Optimize ID
fbq\(["']init["'], ["'](.*?)["'] Extract the Facebook Pixel ID
\{ti:["'](.*?)["']\} Extract the Bing Ads UET tag
adroll_adv_id = ["'](.*?)["'] Extract the AdRoll Advertiser ID
adroll_pix_id = ["'](.*?)["'] Extract the AdRoll Pixel ID

 

How to Extract Schema Markup in JSON-LD Format

These regex rules can be used when a website’s schema markup is in JSON-LD format, like this:

If it’s in microdata format, then jump to the section on how to extract schema markup with XPath.

 

Extract All Schema Markup and Schema Types:

Regex Output
["']application/ld\+json["']>(.*?)</script> Extract all of the JSON-LD schema markup
["']@type["']: *["'](.*?)["'] Extract all of the types of JSON-LD schema markup on a page

Example Screaming Frog Custom Extraction:

Extract Breadcrumb Schema:

Regex Output
["']item["']: *\{["']@id["']: *["'](.*?)["'] Extract breadcrumb links
["']item["']: *\{["']@id["']: *["'].*?["'], *["']name["']: *["'](.*?)["'] Extract breadcrumb names

Extract Product Schema:

Regex Output
["']@type["']: *["']Product["'].*?["']name["']: *["'](.*?)["'] Extract product name
["']@type["']: *["']Product["'].*?["']description["']: *["'](.*?)["'] Extract product description
["']@type["']: *["']Product["'].*?["']price["']: *["'](.*?)["'] Extract product price
["']@type["']: *["']Product["'].*?["']priceCurrency["']: *["'](.*?)["'] Extract product currency
["']@type["']: *["']Product["'].*?["']availability["']: *["'](.*?)["'] Extract product availability
["']@type["']: *["']Product["'].*?["']sku["']: *["'](.*?)["'] Extract product SKU

Extract Review Schema:

Regex Output
["']reviewCount["']: *["'](.*?)["'] Extract review count
["']ratingValue["']: *["'](.*?)["'] Extract rating value
["']bestRating["']: *["'](.*?)["'] Extract best rating

Extract Local Business & Organization Schema:

Regex Output
["']@type["']: *["']Organization["'].*?["']name["']: *["'](.*?)["'] Extract organization name
["']streetAddress["']: *["'](.*?)["'] Extract the street address
["']addressLocality["']: *["'](.*?)["'] Extract the address locality
["']addressRegion["']: *["'](.*?)["'] Extract the address region
["']telephone["']: *["'](.*?)["'] Extract the telephone number
["']sameAs["']: *\[(.*?)\] Extract the "sameAs" links

Extract Article or BlogPosting Schema:

Regex Output
["']headline["']: *["'](.*?)["'] Extract article headline
["']author["'].*?["']name["']: *["'](.*?)["'] Extract author name
["']publisher["'].*?["']name["']: *["'](.*?)["'] Extract publisher name
["']datePublished["']: *["'](.*?)["'] Extract publish date
["']dateModified["']: *["'](.*?)["'] Extract modified date

Additional Resources

Here are several great articles that provide additional perspective and examples:

If you have any useful XPath or regex extraction rules you’ve used, please let me know in the comments and I’ll add them to the guide.

Griffin Roer

Griffin Roer

Griffin has spent more than a decade in the search engine marketing industry. After years of working as an SEO consultant to some of the country’s largest retail and tech brands, Griffin pursued his entrepreneurial calling and founded Uproer in May of 2017. He's also served as a board member for the Minnesota Search Engine Marketing Association.

See More Insights

SearchLite - Don't Sleep on Category Page Content

This month’s SearchLite intro was written by Content Manager, Skye Sonnega Hey everyone, We recently livened up this dull Minnesota winter with two fantastic additions to the Uproer team! Eric Davison joined as a Senior SEM Analyst, and Jenny Hudalla joined as a Content Specialist. These folks are bright, hardworking, Minnesota-local, and coming in

Read More
MGWM

Sr. Manager, SEO & Operations

Dave Sewich

Dave made an accidental foray into digital marketing after graduating from the University of Minnesota Duluth and hasn’t looked back. Having spent the first part of his marketing journey brand-side, he now works with the Uproer team to help clients realize their goals through the lens of search.

When not at work, you’ll find Dave staying active and living a healthy lifestyle, listening to podcasts, and enjoying live music. A Minnesotan born and raised, his favorite sport is hockey and he still finds time to skate once in a while.

Dave’s DiSC style is C. He enjoys getting things done deliberately and systematically without sacrificing speed and efficiency. When it comes to evaluating new ideas and plans, he prefers to take a logical approach, always sprinkling on a bit of healthy skepticism for good measure. At work, Dave’s happiest when he has a chance to dive deep into a single project for hours at a time. He loves contributing to Uproer and being a part of a supportive team but is most productive when working solo.

Founder & CEO

Griffin Roer

Griffin discovered SEO in 2012 during a self-taught web development course and hasn’t looked back. After years of working as an SEO consultant to some of the country’s largest retail and tech brands, Griffin pursued his entrepreneurial calling of starting an agency in May of 2017.

Outside of work, Griffin enjoys going to concerts and spending time with his wife, two kids, and four pets.

Griffin’s DiSC style is D. He’s driven to set and achieve goals quickly, which helps explain why he’s built his career in the fast-paced agency business. Griffin’s most valuable contributions to the workplace include his motivation to make progress, his tendency towards bold action, and his willingness to challenge assumptions.