lualu: Beautiful Soup with Chinese.

九月 18, 2006

Beautiful Soup with Chinese.

标签： Text

BS对unicode文档处理的基本流程是:

首先判断 BeautifulSoup(html, fromEncoding=encoding)中指定encoding
如果没有指定 encoding，则调用 chardet (如果已安装)，或是内置的unicodeDammit尝试探测编码
最后将编码转换为unicode，进行相关的解析

而输出字符的流程更为简单:

通过__str__(coding) 或 renderContents(coding) 指定输出的编码:encode(coding)
如果coding没有指定，则以 'utf-8' 作为默认输出

所以，若是处理 "gbk" 的文档，需要这样使用

soup = BeautifulSoup(html_str, fromEncoding="gbk")
print soup.__str__('gbk')
# or
print soup.__str__().decode('utf-8')

4 条评论:

动态感觉静观其变说...: 好啊，终于看到sa的文章了; 10:24
匿名说...: hehe,我是xlp223,看到sa的文章，感觉很好。; 10:26
putt 说...: -_-||
啥意思?
这一篇是bs的吧....; 13:14
moments 说...: beautifulsoup default settings disable chardet. if you are dealing with chinese webpages/forums, install chardet and enable it in beautifulsoup source code.; 08:31

订阅：博文评论 (Atom)