In this guide, I’ll show you how to use Screaming Frog’s Custom Extraction feature to scrape schema markup, HTML, inline JavaScript and more using XPath and regex. I’ve included plenty of Custom Extraction examples that you can copy or modify for your own Screaming Frog crawls.

Why Use the Custom Extraction Feature?

Screaming Frog scrapes a lot of critical information by default; page titles, H1 elements, canonical tags, etc. But what if you want to pull other data points into your site crawls?

With Custom Extraction, you can program Screaming Frog to scrape just about any information you want. Once you get a handle on how to use it, you can conduct more advanced site crawls and analysis.

For example, here are a few ways that I’ve used Custom Extraction to develop insights and recommendations for my clients’ SEO strategies:

  • Extract publish date to analyze the SEO performance of content by age
  • Extract the comment count of blog articles to show the client which topics drive the most engagement
  • Extract the product availability property from an ecommerce site’s schema markup to help understand how Google was indexing out of stock products

How to Use Screaming Frog Custom Extraction

You can access the Custom Extraction feature in the Configuration dropdown under Custom > Extraction.

how to use screaming frog custom extraction

Next, you set up your extraction rules. This guide will give you all the instruction you’ll need to create your own.

  • Extractor Name: The name will appear as the column header for your custom extractions
  • Extraction Method: Choose XPath, Regex or CSSPath
  • Rule: This is where you’ll enter the XPath or regex syntax that we cover in this guide
  • Extraction Filter: Choose Extract Inner HTML, Extract HTML Element, Extract Text, or Extract Function Value (note that these options are not available with regex)

As you run crawls using Custom Extraction, you may find yourself toggling between the Extraction Filter options in order to arrive at the format you want for your data. Several of the examples covered in this guide require a particular Extraction Filter to be selected.

The data you extract is available in the Custom tab. Set the Filter dropdown to Extraction.

custom tab in screaming frog

It’s also available as a column in the Internal tab alongside all of the default fields that Screaming Frog populates.

Custom Extraction with XPath

What is XPath?

XPath stands for XML Path Language. XPath can be used to navigate through elements and attributes in an XML document.

Source

When to Use XPath

Use XPath to extract any HTML element of a webpage. If you want to scrape information contained in a div, span, p, heading tag or really any other HTML element, then go with XPath.

Google Chrome has a feature that makes writing XPath easier. Using the Inspect tool, you can right-click on any element and copy the XPath syntax. It’ll often be the case that you’ll need to modify what Chrome gives you before pasting the XPath into Screaming Frog, but it at least gets you started.

inspect element copy xpath

Basic Syntax for XPath Web Scraping

Here is the basic syntax for XPath web scraping:

Syntax Function
// Search anywhere in the document
/ Search within the root
@ Select a specific attribute of an element
* Wildcard, used to select any element
[ ] Find a specific element
. Specifies the current element
.. Specifies the parent element

 
Here are common XPath functions:

Operator Function
starts-with(x,y) Checks if x starts with y
contains(x,y) Checks if x contains y
last() Finds the last item in a set
count(XPath) Counts occurrences of the XPath extraction

 

XPath Custom Extraction Examples

In the tables below, you can copy the syntax in the XPath column and paste it into Screaming Frog to perform the extraction described in the Output column. Tweak the syntax as you’d like in order to customize the extraction to your needs.

How to Extract Common HTML Elements

XPath Output
//h1 Extract all H1 tags
//h3[1] Extract the first H3 tag
//h3[2] Extract the second H3 tag
//div/p Extract any <p> contained within a <div>
//div[@class='author'] Extract any <div> with class “author”
//p[@class='bio'] Extract any <p> with class “bio”
//*[@class='bio'] Extract any element with class “bio”
//ul/li[last()] Extract the last <li> in a <ul>
//ol[@class='cat']/li[1] Extract the first <li> in a <ol> with class “cat”
count(//h2) Count the number of H2’s (set extraction filter to “Function Value”)
//a[contains(.,'click here')] Extract any link with anchor text containing “click here”
//a[starts-with(@title,'Written by')] Extract any link with a title starting with “Written by”

 

How to Extract Common HTML Attributes

XPath Output
//@href Extract all links
//a[starts-with(@href,'mailto')]/@href Extract link that starts with “mailto” (email address)
//img/@src Extract all image source URLs
//img[contains(@class,'aligncenter')]/@src Extract all image source URLs for images with the class name containing “aligncenter”
//link[@rel='alternate'] Extract elements with the rel attribute set to “alternate”
//@hreflang Extract all hreflang values

 

How to Extract Meta Tags (including Open Graph and Twitter Cards)

I recommend setting the extraction filter to “Extract Inner HTML” for these ones.

Extract Meta Tags:

XPath Output
//meta[@property='article:published_time']/@content Extract the article publish date (commonly-found meta tag on WordPress websites)

 
Extract Open Graph:

XPath Output
//meta[@property='og:type']/@content Extract the Open Graph type object
//meta[@property='og:image']/@content Extract the Open Graph featured image URL
//meta[@property='og:updated_time']/@content Extract the Open Graph updated time

 
Extract Twitter Cards:

XPath Output
//meta[@name='twitter:card']/@content Extract the Twitter Card type
//meta[@name='twitter:title']/@content Extract the Twitter Card title
//meta[@name='twitter:site']/@content Extract the Twitter Card site object (Twitter handle)

 

How to Extract Schema Markup in Microdata Format

These XPath rules can be used when a website’s schema markup is in microdata format, like this:

product schema markup in microdata format

If it’s in JSON-LD format, then jump to the section on how to extract schema markup with regex.

Extract Schema Types:

XPath Output
//*[@itemtype]/@itemtype Extract all of the types of schema markup on a page

 

Example Screaming Frog Custom Extraction:


 
Extract Breadcrumb Schema:

XPath Output
//*[contains(@itemtype,'BreadcrumbList')]/*[@itemprop]/a/@href Extract all breadcrumb links
//*[contains(@itemtype,'BreadcrumbList')]/*[@itemprop][1]/a/@href Extract the first breadcrumb link
//*[contains(@itemtype,'BreadcrumbList')]/*[@itemprop] Extract breadcrumb names (set extraction filter to “Extract Text”)
count(//*[contains(@itemtype,'BreadcrumbList')]/*[@itemprop]) Count the number of breadcrumb list items (set extraction filter to “Function Value”)

 

Example Screaming Frog Custom Extraction:


 
Extract Product Schema:

XPath Output
//*[@itemprop='name']/@content Extract product name
//*[@itemprop='description']/@content Extract product description
//*[@itemprop='price']/@content Extract product price
//*[@itemprop='priceCurrency']/@content Extract product currency
//*[@itemprop='availability']/@href Extract product availability
//*[@itemprop='sku']/@content Extract product SKU

 

Example Screaming Frog Custom Extraction:

extract product schema markup with xpath
 
Extract Review Schema:

XPath Output
//*[@itemprop='reviewCount'] Extract review count
//*[@itemprop='ratingValue'] Extract rating value
//*[@itemprop='bestRating'] Extract best review rating
//*[@itemprop='review']/*[@itemprop='name'] Extract review name
//*[@itemprop='review']/*[@itemprop='author'] Extract review author
//*[@itemprop='review']/*[@itemprop='datePublished']/@content Extract the publish date of reviews
//*[@itemprop='review']/*[@itemprop='reviewBody'] Extract the body content of reviews

 

Example Screaming Frog Custom Extraction:


 
Extract Local Business & Organization Schema:

XPath Output
//*[contains(@itemtype,'Organization')]/*[@itemprop='name'] Extract the organization’s name
//*[@itemprop='address']/*[@itemprop='streetAddress'] Extract the street address
//*[@itemprop='address']/*[@itemprop='addressLocality'] Extract the address locality
//*[@itemprop='address']/*[@itemprop='addressRegion'] Extract the address region
//*[@itemprop='telephone'] Extract the telephone number
//*[@itemprop='sameAs']/@href Extract the “sameAs” links

 

Example Screaming Frog Custom Extraction:


 
Extract Article Schema:

XPath Output
//*[contains(@itemtype,'Article')]/*[@itemprop='headline'] Extract the article headline
//*[@itemprop='author']/*[@itemprop='name']/@content Extract author name
//*[@itemprop='publisher']/*[@itemprop='name']/@content Extract publisher name
//*[@itemprop='datePublished']/@content Extract publish date
//*[@itemprop='dateModified']/@content Extract modified date

 

Example Screaming Frog Custom Extraction:


 

Custom Extraction with Regex

What is regex?

A regular expression (regex) is a sequence of characters that define a search pattern. Usually this pattern is used by string searching algorithms for “find” or “find and replace” operations on strings…

Source

When to Use Regex

Where XPath can extract HTML, it stops short of being able to extract inline JavaScript. This is where knowing regex comes in handy.

For example, with regex you can extract schema markup that’s in JSON-LD format. You can extract data out of tracking scripts, like scraping a web page’s Google Analytics tracking ID.

Basic Syntax for Regex

Regex can be complex and confusing, especially when you’re first learning it. This guide isn’t intended to teach you regex, so I recommend starting with this resource aimed at teaching marketers the basics of regex (their examples are specific to Google Analytics, which I think makes regex easier for marketers to grasp).

Next, I suggest reading through Google’s regex guide, which includes more helpful examples.

If you simply need a refresh, here’s a cheat sheet of the regex metacharacters:

Wildcards

Syntax Function
. Match any 1 character
* Match preceding character 0 or more times
? Match preceding character 0 or 1 time
+ Match preceding character 1 or more times
| OR

 

Anchors

Syntax Function
^ String begins with the succeeding character
$ String ends with the preceding character

 

Groups

Syntax Function
( ) Match enclosed characters in exact order
[ ] Match enclosed characters in any order
Match any characters within the specified range

 

Escape

Syntax Function
\ Treat character literally, not as regex

 

Regex Custom Extraction Examples

In the tables below, you can copy the syntax in the Regex column and paste it into Screaming Frog to perform the extraction described in the Output column. Tweak the syntax as you’d like in order to customize the extraction to your needs.

How to Extract Inline JavaScript

With regex, you can extract any code contained within <script> tags. For marketers, this means you can extract information like clients’ tracking ID’s used with their analytics or advertising platforms. Here are a few examples of that:

Regex Output
["'](UA-.*?)["'] Extract the Google Analytics tracking ID
["'](AW-.*?)["'] Extract the Google Ads conversion ID and/or remarketing tag
["'](GTM-.*?)["'] Extract the Google Tag Manager and/or Google Optimize ID
fbq\(["']init["'], ["'](.*?)["'] Extract the Facebook Pixel ID
\{ti:["'](.*?)["']\} Extract the Bing Ads UET tag
adroll_adv_id = ["'](.*?)["'] Extract the AdRoll Advertiser ID
adroll_pix_id = ["'](.*?)["'] Extract the AdRoll Pixel ID

 

How to Extract Schema Markup in JSON-LD Format

These regex rules can be used when a website’s schema markup is in JSON-LD format, like this:

If it’s in microdata format, then jump to the section on how to extract schema markup with XPath.

Extract All Schema Markup and Schema Types:

Regex Output
["']application/ld\+json["']>(.*?)</script> Extract all of the JSON-LD schema markup
["']@type["']: *["'](.*?)["'] Extract all of the types of JSON-LD schema markup on a page

 

Example Screaming Frog Custom Extraction:


 
Extract Breadcrumb Schema:

Regex Output
["']item["']: *\{["']@id["']: *["'](.*?)["'] Extract breadcrumb links
["']item["']: *\{["']@id["']: *["'].*?["'], *["']name["']: *["'](.*?)["'] Extract breadcrumb names

 

Extract Product Schema:

Regex Output
["']@type["']: *["']Product["'].*?["']name["']: *["'](.*?)["'] Extract product name
["']@type["']: *["']Product["'].*?["']description["']: *["'](.*?)["'] Extract product description
["']@type["']: *["']Product["'].*?["']price["']: *["'](.*?)["'] Extract product price
["']@type["']: *["']Product["'].*?["']priceCurrency["']: *["'](.*?)["'] Extract product currency
["']@type["']: *["']Product["'].*?["']availability["']: *["'](.*?)["'] Extract product availability
["']@type["']: *["']Product["'].*?["']sku["']: *["'](.*?)["'] Extract product SKU

 

Extract Review Schema:

Regex Output
["']reviewCount["']: *["'](.*?)["'] Extract review count
["']ratingValue["']: *["'](.*?)["'] Extract rating value
["']bestRating["']: *["'](.*?)["'] Extract best rating

 

Extract Local Business & Organization Schema:

Regex Output
["']@type["']: *["']Organization["'].*?["']name["']: *["'](.*?)["'] Extract organization name
["']streetAddress["']: *["'](.*?)["'] Extract the street address
["']addressLocality["']: *["'](.*?)["'] Extract the address locality
["']addressRegion["']: *["'](.*?)["'] Extract the address region
["']telephone["']: *["'](.*?)["'] Extract the telephone number
["']sameAs["']: *\[(.*?)\] Extract the “sameAs” links

 

Extract Article or BlogPosting Schema:

Regex Output
["']headline["']: *["'](.*?)["'] Extract article headline
["']author["'].*?["']name["']: *["'](.*?)["'] Extract author name
["']publisher["'].*?["']name["']: *["'](.*?)["'] Extract publisher name
["']datePublished["']: *["'](.*?)["'] Extract publish date
["']dateModified["']: *["'](.*?)["'] Extract modified date

 

Additional Resources

Here are several great articles that provide additional perspective and examples:

If you have any useful XPath or regex extraction rules you’ve used, please let me know in the comments and I’ll add them to the guide.