Websites are made up of HTML, which is a form of XML. XPath is a syntax for selecting sections of XML. It is useful for scraping websites. One of the most popular uses for XPath can be found within the lxml library of Python. There are other libraries and programs that allow you to scrape or select data from a webpage using XPath.
I was recently using Python to scrape some data from a website. The site had a nicely built table and I wanted all the information from all the ‘td’ tags within the table. The problem was, some of the ‘td’ tags contained ‘a’ tags because the content of that table data cell was a link.
I found the easiest way to select both text within a ‘td’ tag and the text within a child of the ‘td’ tag, such as a link ‘a’ tag, was to use the ‘//’ double forward slash. This selects all the elements, no matter how deep within the parent element.
So here is an example:
//table//a
This would select all the table elements within a page, and then select all the ‘a’ elements within those table elements, no matter how deep the ‘a’ elements are nested. To get the text, you would do something like this:
//table//a/text()
That would give you the text within the ‘a’ element.
For my problem I did something like this:
//table/tbody/tr/td//text()
This is a bit more specific, but will give me any text within all the selected ‘td’ table data cells. XPath goes as deep as it needs to within the ‘td’ tags to find the text that.
And that is how I was able to get the text from a table no matter if the data cell contained a link or any other nested tag (bold, strong, and other types of tags would also work).`