
python3中使用builtwith模块(使用工具pycharm,命令行也是pycharm自带terminal)
step1: 使用pip install builtwith 来安装builtwith模块
(/Users/jockie/install_programs/anaconda) jockie:~/programs/pycharm$ pip install builtwith
Collecting builtwith
Downloading builtwith-1.3.2.tar.gz
Building wheels for collected packages: builtwith
Running setup.py bdist_wheel for builtwith ... done
Stored in directory: /Users/jockie/Library/Caches/pip/wheels/e4/cf/86/aa813feb4c79e680590a42766642b130358a01f1e26ecfe1d6
Successfully built builtwith
Installing collected packages: builtwith
Successfully installed builtwith-1.3.2
step2: 测试builtwith模块
import builtwith
info = builtwith.parse('http://www.xuanxiewu.com')
print(info)
运行代码报如下错误
/Users/jockie/install_programs/anaconda/bin/python.app /Users/jockie/programs/pycharm/python_spider/chp01_01.py
Traceback (most recent call last):
File "/Users/jockie/programs/pycharm/python_spider/chp01_01.py", line 8, in <module>
import builtwith
File "/Users/jockie/install_programs/anaconda/lib/python3.6/site-packages/builtwith/__init__.py", line 42
except Exception , e:
^
SyntaxError: invalid syntax
Process finished with exit code 1
可以看出报的是语法错误,那为什么会有语法错误呢?原因是builtwith是基于python2.x版本的,所以这里需要做一些相应的语法修改
1.python2的‘Exception , e’写法不支持, 修改为Exception as e
2.python2的print表达式,修改为print()函数
3.builtwith使用的urllib2模块属于python2,python3中使用urllib,所以在__init__.py源码中使用urllib2的地方都需要改urllib的写法,首先需要将 import urllib2替换成
import urllib.request
import urllib.error
再将urllib2相关方法替换
request = urllib.request.Request(url, None, {'User-Agent': user_agent})
# request = urllib2.Request(url, None, {'User-Agent': user_agent})
response = urllib.request.urlopen(request)
# response = urllib2.urlopen(request)
再次运行代码,报如下错误:
/Users/jockie/install_programs/anaconda/bin/python.app /Users/jockie/programs/pycharm/python_spider/chp01_01.py
Traceback (most recent call last):
File "/Users/jockie/programs/pycharm/python_spider/chp01_01.py", line 10, in <module>
info = builtwith.parse('http://www.baidu.com')
File "/Users/jockie/install_programs/anaconda/lib/python3.6/site-packages/builtwith/__init__.py", line 69, in builtwith
if contains(html, snippet):
File "/Users/jockie/install_programs/anaconda/lib/python3.6/site-packages/builtwith/__init__.py", line 111, in contains
return re.compile(regex.split('\\;')[0], flags=re.IGNORECASE).search(v)
TypeError: cannot use a string pattern on a bytes-like object
Process finished with exit code 1
可以看出报的是类型错误,这是因为urllib返回的数据格式已经发生了改变,需要进行转码,将下面的代码
if html is None:
html = response.read()
改为
if html is None:
html = response.read()
html = html.decode('utf-8')
再次运行代码,得到正确结果
/Users/jockie/install_programs/anaconda/bin/python.app /Users/jockie/programs/pycharm/python_spider/chp01_01.py
{'font-scripts': ['Font Awesome', 'Google Font API'], 'web-frameworks': ['Twitter Bootstrap'], 'javascript-frameworks': ['jQuery']}
Process finished with exit code 0
但是,再看上面的解码使用的是utf-8,写死了,如果网站用的不是utf-8呢,这里再试验下,以www.163.com为例,使用的是gbk,再次运行,又报如下错误
/Users/jockie/install_programs/anaconda/bin/python.app /Users/jockie/programs/pycharm/python_spider/chp01_01.py
Error: 'utf-8' codec can't decode byte 0xcd in position 565: invalid continuation byte
Traceback (most recent call last):
File "/Users/jockie/programs/pycharm/python_spider/chp01_01.py", line 10, in <module>
info = builtwith.parse('http://www.163.com')
File "/Users/jockie/install_programs/anaconda/lib/python3.6/site-packages/builtwith/__init__.py", line 69, in builtwith
if contains(html, snippet):
File "/Users/jockie/install_programs/anaconda/lib/python3.6/site-packages/builtwith/__init__.py", line 111, in contains
return re.compile(regex.split('\\;')[0], flags=re.IGNORECASE).search(v)
TypeError: cannot use a string pattern on a bytes-like object
Process finished with exit code 1
将编码改为gbk,得到正确结果
/Users/jockie/install_programs/anaconda/bin/python.app /Users/jockie/programs/pycharm/python_spider/chp01_01.py
{'web-servers': ['Nginx']}
Process finished with exit code 0
那么问题来了,不同的网站编码不一定相同,如果每次换一个网站,就要改一遍编码的话,那将增加许多额外的工作量,也是不现实的,那么有没有方法做到一劳永逸呢,这里就需要引入chardet模块,同样使用:pip install chardet,将builtwith源码,做如下修改
if html is None:
html = response.read()
# html = html.decode('utf-8') # add by Johnahton 20170805
encode_type = chardet.detect(html)
if encode_type['encoding'] == 'utf-8':
html = html.decode('utf-8')
else:
html = html.decode('gbk')
加入chardet判断字符编码后,就可以一劳永逸了!