1.2 发现获取页面内容出现乱码运行获取页面源码乱码截图
1.3 分析页面信息得原页面编码为:`gb2312`,修改获取内容编码分析页码源码乱码解决源码乱码情况
2.1 主页面源码已经获取到了,那我们到网页里看看源码的效果图吧爬取页面web端效果图有没有让你看的怦然心动,我是觉得清纯的妹纸挺好的。
2.2 爬取方式:简单 or 困难爬取方式分析
3.1 从简单开始:首先我们要获取这个页面上的所有妹纸图的链接(一个妹纸有多张艺术照),然后向网站发送相应链接的请求,浏览器内按F12,进入开发者模式,小箭头选择想要的信息所在处。页面分析获取内容所在
3.2 分析我们拿到的页面源码,采用正则表达式获取相关内容源码分析采用什么正则匹配
3.3 正则表达式获取首页所有妹纸图的网址和简介 1''' 2author : 极简XksA 3data : 2018.
8.8 4goal : 分类爬取beautiful picture,保存到本地 5''' 6import re 7import requests 8# 爬取主页面:http://www.27270.com/ent/meinvtupian/ 910#
1. 发送http请求,获取主页面内容11r_url = 'http://www.27270.com/ent/meinvtupian/'12html_code = requests.get(r_url)13#
2. 设置页面编码为 gb231214html_code.encoding = 'gb2312'15html_text = html_code.text16#
3.1 获取链接17pattern01 = r'.*? 美女图片'18beautiful_url = re.findall(pattern01, html_text)19print(beautiful_url)20print(len(beautiful_url))21#
3.2 获取简介22pattern02 = r'
4.1 页面分析单组照片页面分析
3.
4.2 代码实现 1for i in range(len(beautiful_url)): 2 #
4.1 请求单个页面 3 picture_codes = requests.get(beautiful_url[i]) 4 picture_codes.encoding = 'gb2312' 5 picture_words = picture_codes.text 6 #
4.2 在页面中找到图片url 7 # print(picture_words) 8 pattern03 = r'
3.
4.3 运行结果1# 这里获取的并不是全部,而是每个妹纸的第一张图2['http://t2.hddhhn.com/uploads/tu/201803/9999/f45065ed61.jpg'],['http://t2.hddhhn.com/uploads/tu/201803/9999/9126579004.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/320ab4622e.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/1fde4d7a1f.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/ef21eaa896.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/e1697062d3.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/419c69bec1.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/4302dc643c.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/df7ff261b0.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/b7b870636f.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/11ec3cf8b2.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/10a0a11a02.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/53e4e2717c.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/7431e6e040.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/228cf34f62.jpg'],3['http://t2.hddhhn.com/uploads/tu/201807/9999/a9b7d62201.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/ba91f1e60e.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/76da610fa9.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/3ed260e5ae.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/3d93b5fd09.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/280277b310.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/b69662e2d9.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/fbf7a9178b.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/3f9a20a7da.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/691c12fa18.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/249d3362c4.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/29ea1b5fb7.jpg'],['http://t2.hddhhn.com/uploads/tu/201807/9999/db087ab231.jpg'],['http://t2.hddhhn.com/uploads/tu/201803/9999/1a9b5f8522.jpg'],['http://t2.hddhhn.com/uploads/tu/201803/9999/9b597acb26.jpg']
3.5 每个妹纸页面内翻页,爬取所有的图片地址
3.
5.1 页面分析页面翻页分析我们只用获取网页内代码即可实现页面翻转(即跳转url)
3.
5.2 代码实现 1#
4.3 翻页爬取 2#
4.
3.1 获取翻页链接 3pattern04 = r"
4.
3.2 翻页,获取图片地址 8for j in range(len(pictures_url)): 9 other_picture_url = r'http://www.27270.com/ent/meinvtupian/2018/{0}'.format(pictures_url[j])10 pictures_codes = requests.get(other_picture_url)11 pictures_codes.encoding = 'gb2312'12 pictures_words = pictures_codes.text13 picture_02 = re.findall(pattern03, pictures_words)14 picture_address.append(picture_02)15print(picture_address)
3.
5.3 运行结果1['261848_2.html', '261848_3.html', '261848_4.html', '261848_5.html', '261848_6.html', '261848_7.html', '261848_8.html']2
3.6 下载图片
3.
6.1 专门写个下载图片函数 1# 下载图片函数 2''' 3folder_name : 分类文件夹名称,按图片简介 4picture_address : 一组图片的链接 5''' 6def download_pictures(folder_name, picture_address): 7 file_path = r'G:\Beautiful\{0}'.format(folder_name) 8 if not os.path.exits(file_path): 9 # 新建一个文件夹10 os.mkdir(os.path.join(r'G:\Beautiful', folder_name))11 # 下载图片保存到新建文件夹12 for i in range(len(picture_address)):13 # 下载文件(wb,以二进制格式写入)14 with open(r'G:\Beautiful\{0}\0{1}.jpg'.format(folder_name,i+1), 'wb') as f:15 # 根据下载链接,发送请求,下载图片16 response = requests.get(picture_address[i][0])17 f.write(response.content)
3.
6.2 运行结果运行结果爬取结果(文字有点露骨)单组皂片示例,看图片大小,挺清晰的
4. 上面只是爬取了主页面的所有妹纸图片,如何实现在主页面翻页呢?希望大家根据在单个妹子网页翻页的方法(
3.5),自己动手写出如何在主页面翻页,爬取更多妹纸图!实现下图页面翻页主页面翻页最后留上源码地址https://gitee.com/ShaErHu/spider_beautiful_picture