Use builtwith in python3


python3中使用builtwith模块(使用工具pycharm,命令行也是pycharm自带terminal)

step1: 使用pip install builtwith 来安装builtwith模块

    (/Users/jockie/install_programs/anaconda) jockie:~/programs/pycharm$ pip install builtwith
    Collecting builtwith
      Downloading builtwith-1.3.2.tar.gz
    Building wheels for collected packages: builtwith
      Running setup.py bdist_wheel for builtwith ... done
      Stored in directory: /Users/jockie/Library/Caches/pip/wheels/e4/cf/86/aa813feb4c79e680590a42766642b130358a01f1e26ecfe1d6
    Successfully built builtwith
    Installing collected packages: builtwith
    Successfully installed builtwith-1.3.2

step2: 测试builtwith模块

    import builtwith
    info = builtwith.parse('http://www.xuanxiewu.com')
    print(info)

运行代码报如下错误

    /Users/jockie/install_programs/anaconda/bin/python.app /Users/jockie/programs/pycharm/python_spider/chp01_01.py
    Traceback (most recent call last):
      File "/Users/jockie/programs/pycharm/python_spider/chp01_01.py", line 8, in <module>
        import builtwith
      File "/Users/jockie/install_programs/anaconda/lib/python3.6/site-packages/builtwith/__init__.py", line 42
        except Exception , e:
                         ^
    SyntaxError: invalid syntax

    Process finished with exit code 1

可以看出报的是语法错误,那为什么会有语法错误呢?原因是builtwith是基于python2.x版本的,所以这里需要做一些相应的语法修改
1.python2的‘Exception , e’写法不支持, 修改为Exception as e
2.python2的print表达式,修改为print()函数
3.builtwith使用的urllib2模块属于python2,python3中使用urllib,所以在__init__.py源码中使用urllib2的地方都需要改urllib的写法,首先需要将 import urllib2替换成

    import urllib.request
    import urllib.error

再将urllib2相关方法替换

    request = urllib.request.Request(url, None, {'User-Agent': user_agent})
    # request = urllib2.Request(url, None, {'User-Agent': user_agent})
    response = urllib.request.urlopen(request)
    # response = urllib2.urlopen(request)

再次运行代码,报如下错误:

    /Users/jockie/install_programs/anaconda/bin/python.app /Users/jockie/programs/pycharm/python_spider/chp01_01.py
    Traceback (most recent call last):
      File "/Users/jockie/programs/pycharm/python_spider/chp01_01.py", line 10, in <module>
        info = builtwith.parse('http://www.baidu.com')
      File "/Users/jockie/install_programs/anaconda/lib/python3.6/site-packages/builtwith/__init__.py", line 69, in builtwith
        if contains(html, snippet):
      File "/Users/jockie/install_programs/anaconda/lib/python3.6/site-packages/builtwith/__init__.py", line 111, in contains
        return re.compile(regex.split('\\;')[0], flags=re.IGNORECASE).search(v)
    TypeError: cannot use a string pattern on a bytes-like object

    Process finished with exit code 1

可以看出报的是类型错误,这是因为urllib返回的数据格式已经发生了改变,需要进行转码,将下面的代码

    if html is None:  
        html = response.read() 

改为

    if html is None:  
         html = response.read()  
         html = html.decode('utf-8')

再次运行代码,得到正确结果

    /Users/jockie/install_programs/anaconda/bin/python.app /Users/jockie/programs/pycharm/python_spider/chp01_01.py
    {'font-scripts': ['Font Awesome', 'Google Font API'], 'web-frameworks': ['Twitter Bootstrap'], 'javascript-frameworks': ['jQuery']}

    Process finished with exit code 0

但是,再看上面的解码使用的是utf-8,写死了,如果网站用的不是utf-8呢,这里再试验下,以www.163.com为例,使用的是gbk,再次运行,又报如下错误

    /Users/jockie/install_programs/anaconda/bin/python.app /Users/jockie/programs/pycharm/python_spider/chp01_01.py
    Error: 'utf-8' codec can't decode byte 0xcd in position 565: invalid continuation byte
    Traceback (most recent call last):
      File "/Users/jockie/programs/pycharm/python_spider/chp01_01.py", line 10, in <module>
        info = builtwith.parse('http://www.163.com')
      File "/Users/jockie/install_programs/anaconda/lib/python3.6/site-packages/builtwith/__init__.py", line 69, in builtwith
        if contains(html, snippet):
      File "/Users/jockie/install_programs/anaconda/lib/python3.6/site-packages/builtwith/__init__.py", line 111, in contains
        return re.compile(regex.split('\\;')[0], flags=re.IGNORECASE).search(v)
    TypeError: cannot use a string pattern on a bytes-like object

    Process finished with exit code 1

将编码改为gbk,得到正确结果

    /Users/jockie/install_programs/anaconda/bin/python.app /Users/jockie/programs/pycharm/python_spider/chp01_01.py
    {'web-servers': ['Nginx']}

    Process finished with exit code 0

那么问题来了,不同的网站编码不一定相同,如果每次换一个网站,就要改一遍编码的话,那将增加许多额外的工作量,也是不现实的,那么有没有方法做到一劳永逸呢,这里就需要引入chardet模块,同样使用:pip install chardet,将builtwith源码,做如下修改

        if html is None:
            html = response.read()
            # html = html.decode('utf-8')  # add by Johnahton 20170805
            encode_type = chardet.detect(html)
            if encode_type['encoding'] == 'utf-8':
                html = html.decode('utf-8')
            else:
                html = html.decode('gbk')

加入chardet判断字符编码后,就可以一劳永逸了!


文章作者: keepwonder
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 keepwonder !
  目录