Rotating Free Elite Proxies in Python 3 🕷

Thu, Aug 27, 2020 5-minute read

I recently started playing around with web scraping for one of my data mining projects. Like most serious web scrapers I wanted to avoid getting blocked by the websites which I wanted to scrape. (Though, I have to admit that I did this only for a learning purpose on a webserver specifically setup up for this task.).

Some common techniques for minimizing the risk of getting blocked are:

  • Rotating IP addresses
  • Using Proxies
  • Rotating and Spoofing user agents
  • Using headless browsers
  • Reducing the crawling rate

Like you probably guessed from the title of this post, I’ll be focusing on the first two bullet points. If those two points are combined, it’s often called a rotating proxy.

A rotating proxy is a proxy server that assigns a new IP address from the proxy pool for every connection. That means you can launch a script to send 1,000 requests to any number of sites and get 1,000 different IP addresses.

As this isn’t something I came up with myself, lot’s of paid solution such as scraperapi.com, scrapehero.com, parsehub.com could be purchased as a service. While these have dedicated proxies of various kinds (such as mobile or residential proxies) and most often don’t even require rotating proxies manually, they tend to get pricey quite quickly (especially if you are running a hobby project). If you don’t care too much, about reliability, confidetiality and or scalability, then I’ve got a free pythonic alternative.

There ain’t no such thing as a free lunch 🍱

Free proxies can not just be super slow, but free proxies are also notorious for man-in-the-middle attacks. It’s estimated, that roughly a quarter, is modifying the content passed through them. Also, confidentiality isn’t really a priority for free proxy servers, as more than 60 % are completely banning SSL encrypted traffic. An interesting read up, about the risks of free proxies, can be found here. 1

Therefore, you should always use HTTPS enabled proxies and never transmit any sensitive data (such as passwords, session cookies, tokens etc.) through them, as those could end up in the wrong hands.

If you are using proxy servers for the sake of anonymity, make sure to use an elite proxy, as they are not revealing the source IP in the requests headers.

How it works in theory

Great if you made it past the disclaimer. Let’s have a look at how a script for rotating proxies could work in theory.

It works like this:

  1. Get a list of free elite proxies from https://sslproxies.org/, which allow HTTPS and store them as list entries.
  2. Make a new GET request using Python requests module:
    1. Select a random proxy server from the list
    2. Return the response (website) if successful
    3. In case of an error, remove the proxy server from the list. This step is crucial if you’re using free proxy servers as they are oftentimes overloaded or no longer available. Further we should catch any SSL connection errors.
    4. Rinse and repeat
  3. Once the list of available servers is used up, we’ll do a complete refresh of the proxy list.

What the implementation looks like in Python

Once I knew how it should work in theory, it was quite easy to implement it in Python. Basically, I recycled a couple of other scripts I had seen online and aggregated them into one neat piece of code. 2 3. A couple of adjustments I made my self, was to strictly filter for elite proxies, as they are considered more secure than transparent or anonymous proxies. I further bundled everything in a class, which allows me to reuse the whole script in lots of different projects.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
from random import choice

import requests
from lxml.html import fromstring


class Proxies:

    def __init__(self):
        self.proxy = None
        self.proxies = []

    @staticmethod
    def get_proxies():
        url = 'https://sslproxies.org/'
        response = requests.get(url)
        parser = fromstring(response.text)
        p = []
        for i in parser.xpath('//tbody/tr'):
            if i.xpath('.//td[7][contains(text(),"yes")]'):
                if i.xpath('.//td[5][contains(text(),"elite proxy")]'):
                    proxy = ":".join([i.xpath('.//td[1]/text()')[0], i.xpath('.//td[2]/text()')[0]])
                    p.append(proxy)
        return p

    @staticmethod
    def to_proxy(proxy):
        return {"http": proxy, "https": proxy}

    def get(self):
        if len(self.proxies) < 2:
            self.proxies = self.get_proxies()
        self.proxy = choice(self.proxies)
        return self.to_proxy(self.proxy)

    def remove(self):
        # Remove invalid proxy from pool
        self.proxies.remove(self.proxy)
        self.proxy = self.get()

    def scrape(self, url, **kwargs):
        # Retry until request was sucessful
        while True:
            try:
                proxy = self.get()
                print("Proxy currently being used: {}".format(proxy))
                response = requests.get(url, proxies=proxy, timeout=7, **kwargs)
                break
            # if the request is successful, no exception is raised
            except requests.exceptions.ProxyError:
                print("Proxy error, choosing a new proxy")
                self.remove()
            except requests.exceptions.ConnectTimeout:
                print("Connect error, choosing a new proxy")
                self.remove()
        return response

How to use the script

The script can be used like this:

1
2
3
4
5
6
import proxies
proxy = proxies.Proxies()
# Make each request using a randomly selected proxy
for i in range(10):
    r = proxy.scrape('https://httpbin.org/ip')
    print(r.text)

The output will look like this:

Proxy currently being used: {'http': '103.194.171.162:5836', 'https': '103.194.171.162:5836'}
{
  "origin": "103.194.171.162"
}

Proxy currently being used: {'http': '85.10.219.98:1080', 'https': '85.10.219.98:1080'}
Proxy error, choosing a new proxy
Proxy currently being used: {'http': '103.194.171.161:5836', 'https': '103.194.171.161:5836'}
{
  "origin": "103.194.171.161"
}

I hope my script is also useful for your own Python webscraping projects.


  1. The Risks of a Free Proxy Server proxyrack.com ↩︎

  2. How To Rotate Proxies and change IP Addresses using Python 3 scrapehero.com ↩︎

  3. Proxy Rotator in Python – Complete Guide zenscrape.com ↩︎