
Web Scraping with Javascript and Node.js
24 June, 2022
12
12
0
Contributors
Prerequisites
Introduction
Scraping the Basics

Paginator in DevTools

Copy Selector from DevTools
#main > div:nth-child(2) > nav > ul > li:nth-child(2) > a
. This approach might be a problem in the future because it will stop working after any minimal change. Besides, it will only capture one of the pagination links, not all of them..page-numbers a
will capture all and then extract the URLs (href
s) from those. The selector will match all the link nodes with an ancestor containing the class page-numbers
.
product
, which makes our job easier. And for each of them, the h2
tag and price
node hold the content we want. As for the product ID, we need to match an attribute instead of a class or node type. That can be done using the syntax node[attribute="value"]
. We are looking only for the node with the attribute, so there is no need to match it to any particular value.Following Links
await is only valid in async function
. That will force us to wrap the initial code inside a function, concretely in an IIFE (Immediately Invoked Function Expression). The syntax is a bit weird. It creates a function and then calls it immediately.Avoid Blocks
axios/0.21.1
). In general, it is a good practice to send actual headers along with the UA. That means we need a real-world set of headers because not all browsers and versions use the same ones. We include two in the snippet: Chrome 92 and Firefox 90 in a Linux machine.Headless Browsers
axios.get
, which can be inadequate in some cases. Say we need Javascript to load and execute or interact in any way with the browser (via mouse or keyboard). While avoiding them would be preferable - for performance reasons -, sometimes there is no other choice. Selenium, Puppeteer, and Playwright are the most used and known libraries. The snippet below shows only the User-Agent, but since it is a real browser, the headers will include the entire set (Accept, Accept-Encoding, etcetera).Now we can separate getting the HTML in a couple of functions, one using Playwright and the other Axios. We would then need a way to select which one is appropriate for the case at hand. For now, it is hardcoded. The output, by the way, is the same but quite faster when using Axios.
Using Javascript’s Async
await
would be enough, right? Well... not so fast.crawl
and immediately take the following item from the toVisit
set. The problem is that the set is empty since the crawling of the first page didn't occur yet. So we added no new links to the list. The function keeps running in the background, but we already exited from the main one.queue
in lines 1-20. It will return an object with the function enqueue
to add a task to the list. Then it checks if we are above the concurrency limit. If we are not, it will sum one to running
and enter a loop that gets a task and runs it with the provided params. Until the task list is empty, then subtract one from running
. This variable is the one that marks when we can or cannot execute any more tasks, only allowing it below the concurrency limit. In lines 23-28, there are helper functions sleep
and printer
. Instantiate the queue in line 30 and enqueue items in 32-34 (which will start running 4).Final Code
Conclusion
1.
2.
3.
4.
Originally published at https://www.zenrows.com
webscraping
node
javascript
web scraping
scraping