Effective Use of Proxies in Web Scraping to Avoid Blocks
Written on
Chapter 1: Introduction to Proxies in Web Scraping
When scraping websites, it is often crucial to utilize proxies to prevent your scraper from being blocked. I was surprised by how straightforward this process can be.
Photo by Petter Lagson on Unsplash
The initial step is to compile a list of IP addresses to serve as your proxies. Here’s a sample list for illustration purposes:
myProxy = ['119.57.186.93', '12.238.193.167', '112.138.37.226']
I utilize Selenium for web scraping, and while I include various imports, remember to add the specific import needed to configure proxies in Selenium:
import pandas as pd
import os, time
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from datetime import datetime, date, timedelta
import datetime
import smtplib
from dateparser.search import search_dates
import re
from bs4 import BeautifulSoup
import requests
import urllib
import random
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.proxy import Proxy, ProxyType
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options
For my web scraping tasks, I use Firefox, which requires the following code to set up proxy usage:
def getProxy(myProxy):
proxy = Proxy({
'proxyType': ProxyType.MANUAL,
'httpProxy': myProxy,
'ftpProxy': myProxy,
'sslProxy': myProxy,
'noProxy': '' # adjust as necessary
})
return proxy
Next, I initiate the Firefox driver with a random proxy selected from my list:
driver = webdriver.Firefox(proxy=getProxy(myProxy[random.randint(0, len(myProxy) - 1)]))
I frequently use the random function to select a proxy from my list. The getProxy method is essential for managing my proxies effectively. Typically, I rotate proxies after every 50 or 100 iterations during my scraping sessions. That covers the essentials of incorporating proxies into your web scraping strategy.
More insights available at PlainEnglish.io. Subscribe to our complimentary weekly newsletter. Connect with us on Twitter and LinkedIn. Join our community on Discord.
Chapter 2: Practical Examples of Proxy Usage
To enhance your understanding, here are some video resources:
The first video, Web Scraping with Professional Proxy Servers in Python, explains how to effectively utilize proxy servers while scraping. This resource is invaluable for those looking to improve their web scraping techniques.
The second video, Building an Unblockable Web Scraper with Proxies! | Node.js, dives into creating scrapers that can bypass restrictions, showcasing practical methods and tips.