프록시는 사용자를 대신해 인터넷에 연결하는 서버의 IP 주소입니다. 웹사이트에 직접 요청을 전송하는 대신, 프록시는 요청을 서버를 통해 경유시켜 사용자의 실제 IP 주소와 위치를 숨깁니다. 이는 개인 정보 보호, 추적 방지, 차단 회피에 도움이 됩니다. 프록시는 추가 보안 강화를 위해 데이터를 암호화하기도 합니다.

이 글에서는 Python requests를 사용한 프록시 활용법, 특히 웹 스크래핑에 초점을 맞춰 설명합니다. 웹 스크래핑은 웹사이트에서 데이터를 추출하는 작업이지만, 많은 사이트가 제한을 두고 있습니다. 프록시는 IP와 위치를 변경하여 웹사이트가 사용자를 탐지하고 차단하기 어렵게 만들어 이러한 제한을 우회하는 데 도움이 됩니다. 또한 여러 프록시를 사용해 요청을 분산시키고 프로세스 속도를 높일 수 있습니다.

다음으로 Requests Python 패키지를 사용하여 프로젝트에 프록시를 구현하는 방법을 배워보겠습니다.

Python 요청에서 프록시 사용 방법

Python 요청과 함께 프록시를 사용하려면, 웹 스크래핑을 위한 Python 스크립트를 작성하고 실행할 새 Python 프로젝트를 컴퓨터에 설정해야 합니다. 소스 코드 파일을 저장할 디렉터리(예: web_scrape_project)를 생성하세요.

이 튜토리얼의 모든 코드는 이 GitHub 저장소에서 확인할 수 있습니다.

패키지 설치

디렉터리를 생성한 후, 웹 페이지에 요청을 보내고 링크를 수집하기 위해 다음 Python 패키지를 설치해야 합니다:

Bright Data 웹 페이지

프록시 IP 주소의 구성 요소

프록시를 사용하기 전에 그 구성 요소를 이해하는 것이 좋습니다. 프록시 서버의 세 가지 주요 구성 요소는 다음과 같습니다:

프로토콜은 인터넷에서 접근 가능한 콘텐츠 유형을 나타냅니다. 가장 흔한 프로토콜은 HTTP와 HTTPS입니다.
주소: 프록시 서버의 위치를 나타냅니다. 주소는 IP(예: 192.167.0.1) 또는 DNS 호스트명(예: proxyprovider.com)일 수 있습니다.
포트는 단일 머신에서 여러 서비스가 실행될 때 트래픽을 올바른 서버 프로세스로 안내하는 데 사용됩니다(예: 포트 번호 2000).

이 세 가지 구성 요소를 모두 사용하면 프록시 IP 주소는 다음과 같이 표시됩니다: 192.167.0.1:2000 또는 proxyprovider.com:2000.

Requests에서 프록시 직접 설정 방법

Python 요청에서 프록시를 설정하는 방법에는 여러 가지가 있으며, 이 글에서는 세 가지 다른 시나리오를 살펴보겠습니다. 첫 번째 예제에서는 요청 모듈에서 직접 프록시를 설정하는 방법을 배웁니다.

웹 스크래핑을 위해 Python 파일에서 Requests 및 Beautiful Soup 패키지를 임포트해야 합니다. 그런 다음 웹 페이지 스크래핑 시 IP 주소를 숨기기 위한 프록시 서버 정보가 포함된 'proxies' 디렉터리를 생성합니다. 여기서는 프록시 URL에 대한 HTTP 및 HTTPS 연결을 모두 정의해야 합니다.

또한 데이터를 스크래핑할 웹 페이지의 URL을 설정할 파이썬 변수도 정의해야 합니다. 이 튜토리얼에서는 URL이 https://brightdata.com/입니다.

다음으로 request.get() 메서드를 사용하여 웹 페이지에 GET 요청을 전송해야 합니다. 이 메서드는 웹사이트 URL과 프록시라는 두 개의 인수를 받습니다. 그러면 웹 페이지의 응답이 response 변수에 저장됩니다.

링크를 수집하려면 BeautifulSoup() 메서드에 response.content와 html.parser를 인자로 전달하여 Beautiful Soup 패키지로 웹 페이지의 HTML 콘텐츠를 파싱합니다.

그런 다음 find_all() 메서드에 a를 인수로 전달하여 웹 페이지의 모든 링크를 찾습니다. 마지막으로 get() 메서드를 사용하여 각 링크의 href 속성을 추출합니다.

다음은 requests에서 프록시를 직접 설정하는 전체 소스 코드입니다:

# 패키지 임포트  
import requests  
from bs4 import BeautifulSoup  
  
# 사용할 프록시 정의  
proxies = {  
    'http': 'http://proxyprovider.com:2000',  
    'https': 'http://proxyprovider.com:2000',  
}  
  
# 웹 페이지 링크 정의.  
url = "https://brightdata.com/"  
  
# 웹사이트에 GET 요청 전송.  
response = requests.get(url, proxies=proxies)  
  
# BeautifulSoup를 사용하여 웹사이트 HTML 콘텐츠 파싱.  
soup = BeautifulSoup(response.content, "html.parser")  
  
# 웹사이트의 모든 링크를 찾습니다.  
links = soup.find_all("a")  
  
# 모든 링크를 출력합니다.  
for link in links:  
    print(link.get("href"))

이 코드 블록을 실행하면 프록시 IP 주소를 사용하여 정의된 웹 페이지에 요청을 보내고, 해당 웹 페이지의 모든 링크를 포함하는 응답을 반환합니다:

환경 변수를 통한 프록시 설정 방법

때로는 서로 다른 웹 페이지에 대한 모든 요청에 동일한 프록시를 사용해야 할 때가 있습니다. 이 경우 프록시에 대한 환경 변수를 설정하는 것이 합리적입니다.

셸에서 스크립트를 실행할 때마다 프록시 환경 변수를 사용할 수 있도록 하려면 터미널에서 다음 명령을 실행하세요:

export HTTP_PROXY='http://proxyprovider.com:2000'  
export HTTPS_PROXY='https://proxyprovider.com:2000'

여기서 HTTP_PROXY 변수는 HTTP 요청용 프록시 서버를 설정하고, HTTPS_PROXY 변수는 HTTPS 요청용 프록시 서버를 설정합니다.

이제 Python 코드는 몇 줄로 구성되며 웹 페이지에 요청을 할 때마다 환경 변수를 사용합니다:

# 패키지 임포트  
import requests  
from bs4 import BeautifulSoup  
  
# 웹 페이지 링크 정의  
url = "https://brightdata.com/"  
  
# 웹사이트에 GET 요청 전송  
response = requests.get(url)  
  
# BeautifulSoup을 사용하여 웹사이트 HTML 콘텐츠 파싱  
soup = BeautifulSoup(response.content, "html.parser")  
  
# 웹사이트의 모든 링크 찾기.  
links = soup.find_all("a")  
  
# 모든 링크 출력.  
for link in links:  
    print(link.get("href"))

사용자 정의 메서드와 프록시 배열을 사용하여 프록시를 회전하는 방법

프록시 회전은 매우 중요합니다. 웹사이트는 동일한 IP 주소에서 대량의 요청을 수신할 경우 봇이나 스크레이퍼의 접근을 차단하거나 제한하기 때문입니다. 이런 상황이 발생하면 웹사이트는 악의적인 스크레이핑 활동을 의심하고, 결과적으로 접근을 차단하거나 제한하는 조치를 시행할 수 있습니다.

서로 다른 프록시 IP 주소를 순환하여 사용하면 탐지를 피하고, 여러 명의 유기적 사용자로 보이게 하며, 웹사이트에 구현된 대부분의 스크래핑 방지 조치를 우회할 수 있습니다.

프록시 로테이션을 수행하려면 Requests, Beautiful Soup, Random 등 몇 가지 Python 라이브러리를 임포트해야 합니다.

그런 다음 로테이션 과정에서 사용할 프록시 목록을 생성합니다. 이 목록에는 다음 형식으로 프록시 서버의 URL이 포함되어야 합니다: http://proxyserver.com:port:

# 프록시 목록  
proxies = [  
    "http://proxyprovider1.com:2010", "http://proxyprovider1.com:2020",  
    "http://proxyprovider1.com:2030", "http://proxyprovider2.com:2040",  
    "http://proxyprovider2.com:2050", "http://proxyprovider2.com:2060",  
    "http://proxyprovider3.com:2070", "http://proxyprovider3.com:2080",  
    "http://proxyprovider3.com:2090"  
]

그런 다음 get_proxy()라는 사용자 정의 메서드를 생성합니다. 이 메서드는 random.choice() 메서드를 사용하여 프록시 목록에서 무작위로 프록시를 선택하고 선택된 프록시를 사전 형식(HTTP 및 HTTPS 키 모두)으로 반환합니다. 새 요청을 보낼 때마다 이 메서드를 사용합니다:

# 프록시 로테이션을 위한 커스텀 메서드  
def get_proxy():  
    # 목록에서 무작위 프록시 선택  
    proxy = random.choice(proxies)  
    # http 및 https 프로토콜용 프록시가 포함된 사전 반환  
    return {'http': proxy, 'https': proxy}

get_proxy() 메서드를 생성한 후에는, 로테이션된 프록시를 사용하여 특정 수의 GET 요청을 보내는 루프를 만들어야 합니다. 각 요청에서 get() 메서드는 get_proxy() 메서드가 지정한 무작위로 선택된 프록시를 사용합니다.

그런 다음 첫 번째 예제에서 설명한 대로 Beautiful Soup 패키지를 사용하여 웹 페이지의 HTML 콘텐츠에서 링크를 수집해야 합니다.

마지막으로, Python 코드는 요청 과정에서 발생하는 예외를 모두 처리하여 오류 메시지를 콘솔에 출력합니다.

이 예제의 전체 소스 코드는 다음과 같습니다:

# 패키지 임포트  
import requests  
from bs4 import BeautifulSoup  
import random  
  
# 프록시 목록  
proxies = [  
    "http://proxyprovider1.com:2010", "http://proxyprovider1.com:2020",  
    "http://proxyprovider1.com:2030", "http://proxyprovider2.com:2040",  
    "http://proxyprovider2.com:2050", "http://proxyprovider2.com:2060",  
    "http://proxyprovider3.com:2070", "http://proxyprovider3.com:2080",  
    "http://proxyprovider3.com:2090"  
]

  
# 프록시 로테이션을 위한 커스텀 메서드  
def get_proxy():  
    # 목록에서 무작위 프록시 선택  
    proxy = random.choice(proxies)  
    # http 및 https 프로토콜용 프록시가 포함된 사전 반환  
    return {'http': proxy, 'https': proxy}  
  
  
# 회전된 프록시를 사용해 요청 전송  
for i in range(10):  
    # 스크래핑할 URL 설정  
    url = 'https://brightdata.com/'  
    try:  
        # 무작위로 선택된 프록시로 GET 요청 전송  
        response = requests.get(url, proxies=get_proxy())  
  
        # BeautifulSoup를 사용하여 웹사이트의 HTML 콘텐츠 파싱  
        soup = BeautifulSoup(response.content, "html.parser")  
  
        # 웹사이트의 모든 링크 찾기.  
        links = soup.find_all("a")  
  
        # 모든 링크 출력  
        for link in links:  
            print(link.get("href"))  
    except requests.exceptions.RequestException as e:  
        # 요청 중 발생할 수 있는 예외 처리  
        print(e)

Python에서 Bright Data 프록시 서비스 사용하기

웹 스크래핑 작업에 안정적이고 빠르며 신뢰할 수 있는 프록시를 찾고 있다면, 다양한 사용 사례에 맞는 여러 유형의 프록시를 제공하는 웹 데이터 플랫폼 Bright Data를 추천합니다.

Bright Data는 400M+ monthly 개 이상의 주거용 프록시 IP와 770,000개 이상의 데이터센터 프록시로 구성된 대규모 네트워크를 보유하여 안정적이고 빠른 프록시 솔루션을 제공합니다. 그들의 프록시 서비스는 웹 스크래핑, 광고 검증 및 익명성과 효율적인 웹 데이터 수집이 필요한 기타 온라인 활동의 어려움을 극복하도록 설계되었습니다.

Bright Data의 프록시를 Python 요청에통합하는 것은 쉽습니다. 예를 들어, 데이터센터 프록시를 사용하여 이전 예시에서 사용된 URL로 요청을 전송할 수 있습니다.

아직 계정이 없다면 Bright Data 무료 체험판에 가입한 후 플랫폼에 계정을 등록하기 위해 세부 정보를 입력하세요.

완료 후 첫 프록시를 생성하려면 다음 단계를 따르세요:

환영 페이지에서 ‘프록시 제품 보기’를 클릭하여 Bright Data가 제공하는 다양한 프록시 유형을 확인하세요:

새 프록시를 생성하려면 ‘데이터센터 프록시’를 선택하고, 다음 페이지에서 세부 정보를 입력한 후 저장하세요:

프록시가 생성되면 호스트, 포트, 사용자 이름, 비밀번호등 중요한 매개변수를 확인하여 접근 및 사용을 시작할 수 있습니다:

Datacenter Access Parameters screen for proxy configuration.

프록시에 접속한 후, 해당 매개변수 정보를 활용하여 프록시 URL을 구성하고 Requests Python 패키지를 사용해 요청을 전송할 수 있습니다. 프록시 URL 형식은 username-(session-id)-password@host:port입니다.

참고: session-id는 Python의 random 패키지를 사용하여 생성된 난수입니다.

Bright Data 프록시를 Python 요청에 설정하는 코드 샘플은 다음과 같습니다:

import requests  
from bs4 import BeautifulSoup  
import random  
  
# Brightdata에서 제공된 매개변수 정의  
host = 'brd.superproxy.io'  
port = 33335  
username = 'username'  
password = 'password'  
session_id = random.random()  
  
# 프록시 URL 생성  
proxy_url = ('http://{}-session-{}:{}@{}:{}'.format(username, session_id,  
                                                     password, host, port))  
  
# 사전 형식으로 프록시 정의  
proxies = {'http': proxy_url, 'https': proxy_url}  
  
# 웹사이트에 GET 요청 전송  
url = "https://brightdata.com/"  
response = requests.get(url, proxies=proxies)  
  
# BeautifulSoup로 웹사이트 HTML 콘텐츠 파싱  
soup = BeautifulSoup(response.content, "html.parser")  
  
# 웹사이트의 모든 링크 찾기  
links = soup.find_all("a")  
  
# 모든 링크 출력  
for link in links:  
    print(link.get("href"))

여기서 패키지를 임포트하고 프록시 호스트, 포트, 사용자 이름, 비밀번호, session_id 변수를 정의합니다. 그런 다음 http 및 https 키와 프록시 자격 증명을 가진 proxies 사전(dictionary)을 생성합니다. 마지막으로 proxies 매개변수를 requests.get() 함수에 전달하여 HTTP 요청을 수행하고 URL에서 링크를 수집합니다.

그게 전부입니다! Bright Data의 프록시 서비스를 사용하여 성공적인 요청을 수행했습니다.