Use builtwith in python3

DataAnalysis

发布日期: 2017-08-05

文章字数: 993

阅读时长: 5 分

python3中使用builtwith模块（使用工具pycharm,命令行也是pycharm自带terminal）

step1: 使用pip install builtwith 来安装builtwith模块

    (/Users/jockie/install_programs/anaconda) jockie:~/programs/pycharm$ pip install builtwith
    Collecting builtwith
      Downloading builtwith-1.3.2.tar.gz
    Building wheels for collected packages: builtwith
      Running setup.py bdist_wheel for builtwith ... done
      Stored in directory: /Users/jockie/Library/Caches/pip/wheels/e4/cf/86/aa813feb4c79e680590a42766642b130358a01f1e26ecfe1d6
    Successfully built builtwith
    Installing collected packages: builtwith
    Successfully installed builtwith-1.3.2

step2: 测试builtwith模块

    import builtwith
    info = builtwith.parse('http://www.xuanxiewu.com')
    print(info)

运行代码报如下错误

    /Users/jockie/install_programs/anaconda/bin/python.app /Users/jockie/programs/pycharm/python_spider/chp01_01.py
    Traceback (most recent call last):
      File "/Users/jockie/programs/pycharm/python_spider/chp01_01.py", line 8, in <module>
        import builtwith
      File "/Users/jockie/install_programs/anaconda/lib/python3.6/site-packages/builtwith/__init__.py", line 42
        except Exception , e:
                         ^
    SyntaxError: invalid syntax

    Process finished with exit code 1

可以看出报的是语法错误，那为什么会有语法错误呢？原因是builtwith是基于python2.x版本的，所以这里需要做一些相应的语法修改
1.python2的‘Exception , e’写法不支持，修改为Exception as e
2.python2的print表达式，修改为print()函数
3.builtwith使用的urllib2模块属于python2，python3中使用urllib,所以在__init__.py源码中使用urllib2的地方都需要改urllib的写法，首先需要将 import urllib2替换成

    import urllib.request
    import urllib.error

再将urllib2相关方法替换

    request = urllib.request.Request(url, None, {'User-Agent': user_agent})
    # request = urllib2.Request(url, None, {'User-Agent': user_agent})
    response = urllib.request.urlopen(request)
    # response = urllib2.urlopen(request)

再次运行代码,报如下错误：

    /Users/jockie/install_programs/anaconda/bin/python.app /Users/jockie/programs/pycharm/python_spider/chp01_01.py
    Traceback (most recent call last):
      File "/Users/jockie/programs/pycharm/python_spider/chp01_01.py", line 10, in <module>
        info = builtwith.parse('http://www.baidu.com')
      File "/Users/jockie/install_programs/anaconda/lib/python3.6/site-packages/builtwith/__init__.py", line 69, in builtwith
        if contains(html, snippet):
      File "/Users/jockie/install_programs/anaconda/lib/python3.6/site-packages/builtwith/__init__.py", line 111, in contains
        return re.compile(regex.split('\\;')[0], flags=re.IGNORECASE).search(v)
    TypeError: cannot use a string pattern on a bytes-like object

    Process finished with exit code 1

可以看出报的是类型错误，这是因为urllib返回的数据格式已经发生了改变，需要进行转码，将下面的代码

    if html is None:  
        html = response.read()

改为

    if html is None:  
         html = response.read()  
         html = html.decode('utf-8')

再次运行代码，得到正确结果

    /Users/jockie/install_programs/anaconda/bin/python.app /Users/jockie/programs/pycharm/python_spider/chp01_01.py
    {'font-scripts': ['Font Awesome', 'Google Font API'], 'web-frameworks': ['Twitter Bootstrap'], 'javascript-frameworks': ['jQuery']}

    Process finished with exit code 0

但是，再看上面的解码使用的是utf-8，写死了，如果网站用的不是utf-8呢，这里再试验下，以www.163.com为例，使用的是gbk,再次运行，又报如下错误

    /Users/jockie/install_programs/anaconda/bin/python.app /Users/jockie/programs/pycharm/python_spider/chp01_01.py
    Error: 'utf-8' codec can't decode byte 0xcd in position 565: invalid continuation byte
    Traceback (most recent call last):
      File "/Users/jockie/programs/pycharm/python_spider/chp01_01.py", line 10, in <module>
        info = builtwith.parse('http://www.163.com')
      File "/Users/jockie/install_programs/anaconda/lib/python3.6/site-packages/builtwith/__init__.py", line 69, in builtwith
        if contains(html, snippet):
      File "/Users/jockie/install_programs/anaconda/lib/python3.6/site-packages/builtwith/__init__.py", line 111, in contains
        return re.compile(regex.split('\\;')[0], flags=re.IGNORECASE).search(v)
    TypeError: cannot use a string pattern on a bytes-like object

    Process finished with exit code 1

将编码改为gbk，得到正确结果

    /Users/jockie/install_programs/anaconda/bin/python.app /Users/jockie/programs/pycharm/python_spider/chp01_01.py
    {'web-servers': ['Nginx']}

    Process finished with exit code 0

那么问题来了，不同的网站编码不一定相同，如果每次换一个网站，就要改一遍编码的话，那将增加许多额外的工作量，也是不现实的，那么有没有方法做到一劳永逸呢，这里就需要引入chardet模块,同样使用：pip install chardet,将builtwith源码，做如下修改

        if html is None:
            html = response.read()
            # html = html.decode('utf-8')  # add by Johnahton 20170805
            encode_type = chardet.detect(html)
            if encode_type['encoding'] == 'utf-8':
                html = html.decode('utf-8')
            else:
                html = html.decode('gbk')

加入chardet判断字符编码后，就可以一劳永逸了！

keepwonder

http://xuanxiewu.com/2017/08/05/use-builtwith-in-python3/