Python爬虫之requests介绍

阅读量：474 次

发布时间：2019-03-06

本文共 2973 字，大约阅读时间需要 9 分钟。

requests库入门及实用指南

一、基本介绍

requests库为Python提供了一个强大的HTTP客户端，简化了对RESTful APIs和其他HTTP服务的访问。它的优势在于易用性和Pythonic风格，适合开发者快速完成HTTP请求任务。

与urllib和urllib2的区别：

• requests不是标准库，需要单独安装

• 最佳的HTTP客户端库，提供Pythonic风格的API

• 支持分块下载和自动处理重定向

安装方法：

通过pip安装：pip install requests

二、requests请求

1. requests.request(method, url, **kwargs)

创建并发送请求，返回Response对象。

参数：

method：请求方法（get、post、head、put、delete等）

url：请求URL

params：字典形式的查询参数

data：字典、字节数据或文件对象，作为请求体

json：JSON数据，作为请求体

headers：HTTP头字典

cookies：CookieJar对象或字典

files：文件上传参数（支持多部分编码）

auth：身份验证元组（Basic、Digest等）

timeout：超时设置（连接和读取时间）

allow_redirects：是否允许重定向

proxies：代理服务器配置

verify：是否验证SSL证书

stream：是否立即下载响应内容

cert：SSL证书路径或元组

返回：

Response对象，包含状态码、HTTP头、响应内容等信息

示例：

import requests response = requests.request('GET', 'http://httpbin.org/get')

2. requests.get()

发送GET请求，常用于检索数据。

参数包括：

- url：目标URL - params：查询参数字典 - **kwargs：其他请求参数

response = requests.get('http://httpbin.org/get', params={'qs1': 'value1', 'qs2': 'value2'})

3. requests.post()

发送POST请求，常用于提交数据。

参数包括：

- url：目标URL - data：字典形式的请求体 - json：JSON数据形式的请求体 - **kwargs：其他请求参数

response = requests.post('http://httpbin.org/post', data={'name': 'value'})

4. requests.head()

发送HEAD请求，用于获取HTTP头信息。

response = requests.head('http://httpbin.org/head')

5. requests.put()

发送PUT请求，用于更新资源。

response = requests.put('http://httpbin.org/put', data={'name': 'value'})

二、requests应答

requests响应对象包含以下信息：

status_code：HTTP状态码 -

headers：HTTP响应头 -

json：解析后的JSON数据 -

text：Unicode编码的文本内容 -

content：原始字节流数据 -

cookies：响应中的Cookie信息

三、基本用法

以下是一些常见的请求示例：

1. 获取GitHub事件列表：

 import requests response = requests.get('https://api.github.com/events') print(response.status_code) print(response.json())

2. 发送带查询参数的GET请求：

 url = 'http://httpbin.org/get' params = {'qs1': 'value1', 'qs2': 'value2'} response = requests.get(url, params=params) print(response.status_code) print(response.text)

3. 自定义HTTP头：

 headers = {'User-Agent': 'Chrome'} response = requests.get('http://httpbin.org/get', headers=headers) print(response.status_code) print(response.headers)

4. 获取Douban的Cookie：

 url = 'http://www.douban.com' headers = {'User-Agent': 'Chrome'} response = requests.get(url, headers=headers) print(response.status_code) print(response.cookies['bid'])

四、高级用法

1. 使用Session对象：

Session对象可以保持会话参数，提高请求性能和效率。

 session = requests.Session() response = session.get('http://httpbin.org/get', params={'qs1': 'value1'}) response2 = session.get('http://httpbin.org/post', data={'name': 'value'})

2. SSL证书管理：

可以通过参数设置是否验证证书，或者自定义CA证书路径。

# 禁用证书验证 response = requests.get('https://api.example.com', verify=False)

3. 文件上传：

支持上传普通文件和复杂结构的文件，例如：

files = {'file': open('local_file.txt', 'rb')} response = requests.post('http://httpbin.org/post', files=files)

4. 代理访问：

可以通过proxies参数配置代理服务器。

 proxies = { 'http': 'http://proxy.example.com:3128', 'https': 'http://proxy.example.com:1080' } response = requests.get('http://www.example.com', proxies=proxies)