IMPORTXML in Google Sheets
Want to grab data from the internet… quickly?
Maybe you want to copy a table from a website, or maybe you want quickly grab some on-page SEO elements from a competitor.
Now you can do it manually, but that sounds like too much copy and pasting.
IMPORTXML is here to help automate your scraping in Google Sheets.
What is IMPORTXML?
The syntax of the IMPORTXML function is:
The IMPORTXML function imports data from any of various structured data types including XML, HTML, CSV, TSV, and RSS and ATOM XML feeds.
It pulls information from any XML field which means any field enclosed in HTML tags, like <something> and </something>.
You can use it to import publicly available information from the internet. So you are not allowed to use it with websites that you have to log in to access.
When to use IMPORTXML for marketing stuff?
- Scraping data from websites
- When you need on-page data but can’t be bothered to fire up Screaming Frog
How To Use IMPORTXML in Google Sheets
Before we jump into some real uses for IMPORTXML, we need to understand the different arguments:
- The url argument is the URL of the webpage from which you want to import data,
- Include the protocol (https:// or http://).
- The value for URL must either be enclosed in quotation marks or be a reference to a cell containing the appropriate text.
- The xpath_query argument is the XPath query to run on the data of the given URL.
- Each result from the XPath query is placed in its own row of the spreadsheet.
- XPath is a query language used to retrieve pieces of information from websites.
- It has its own syntax that you can learn in details from online resources.
To use XPath, you will need to understand the basics of HTML.
- The data on a website is displayed with HTML and is stored in XML format.
- XPath is the language used to query this data.
- For example, using
“//h1/@title”as the xpath_query returns all the “title” attributes within the <h1> tags accessible on the page.
- You can customize these XPath arguments according to what you’re looking for in the website.
Some Useful XPaths for Marketers:
Checkout these useful XPath extraction shortcuts for getting your grubby hands on data:
- All links on a page:
- Extract all internal links on a page:
- Extract all external links on a page:
- Page title:
- Meta description:
- Hreflang attributes:
You just need to look into the source code of the website from which you want to import data, find the HTML element that can be used and retrieve it by using XPath queries.
To find the HTML element, you have to look at the source of the website. Navigate to the webpage, right-click on the page and click the menu item that says “View source”.
When the source page opens, you’ll see the HTML code for the full page.
For example, part of the source code of the Wikipedia article (on pizza) looks like this:
You may be familiar with HTML tags such as <h2>, <p>, <a> and so on. You need to use these tags in your xpath_query to import the desired data from the websites.
IMPORTXML Formula Examples
Suppose you want to import the paragraph titles of this Wikipedia article.
After looking at the source code, you can see that the titles are enclosed in <h2> tags. So that’s what you have to put in your IMPORTXML formula. In XPath syntax, it should be written as “//h2”.
Try the following function where cell A2 contains the URL of the article:
First, you will see a “Loading…” sign, while the data import is in progress. It takes some time to import from a website (but it’s still faster than doing it manually).
After a few seconds, the function returns what it has found on the website with <h2> tags.
This solution is not perfect because there are some extra  texts next to the titles that the website contains, but it returned the paragraph titles indeed.
This time, some visible text was imported from the website. But with the IMPORTXML function, you can import any data that the HTML source has. For example, it’s very useful to import meta information, link sources or any hidden HTML information that is not visible when browsing the website.
After looking at the source of the website, you can realize that the YouTube videos are enclosed in <iframe> and </iframe> tags. So you have to add //iframe to the XPath argument of the function.
Then, you have to go deeper in this <iframe> tag to find the required data that you want to output. The link source of the videos can be found in the src attributes. Add a “/” slash after the first part of the XPath query followed by the src attribute. The “@” character indicates that it’s an attribute. The whole xpath_query will be: //iframe/@src.
And the whole function to import the YouTube links is:
Sometimes it’s not so obvious what differentiates the HTML code of the content that you want to import, but the XPath language allows you to create more complex queries which make it possible to extract only the desired data.
XPath has some functions that you can include in your queries, such as “contains” or “starts with”.
Let’s see how to import all the references (citations) from the bottom of the Google Sheet article of Wikipedia.
After looking at the source of the website, you can see that the references are enclosed in <li> tags. But that’s not enough since there are many <li> tags on the website.
You need to find something more that differentiates these list element from the other parts of the website. Here it’s the id attribute of the references that all contain the term “cite_note”.
The contain function of XPath can be used to import only the <li> elements whose id contains the ‘cite_note’ term.
The following IMPORTXML function does this:
And it returns all the references that were found on the website with this id.
XPath Query Recap
To sum up, there are a few basic rules to create your own xpath_query argument:
The double slash:
“//” double slash means selecting all elements of the tag that is written after it.
//h1 means select all h1 elements,
//p means all the p element and so on.
[@class=''] means that you only want to select the elements that meet this given criterion. For example,
//h1[@class='title'] only imports the h1 elements whose class attribute is “title”.
You can define these elements on more levels as they appear in the source of the website. For example,
//h1/span will look into the h1 elements and then the span elements inside them.
There are XPath functions you can use to define your xpath_query.
As you can see, IMPORTXML can be a pretty a powerful function in your arsenal.
This simple tutorial is only scratching the surface what you can do (and scrape).