html - How to remove nonAscii characters in python

Question

Welcome To Ask or Share your Answers For Others

html - How to remove nonAscii characters in python

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

html - How to remove nonAscii characters in python

This is my code:

#!C:/Python27/python
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
import urllib2
import sys
import urlparse
import io

url = "http://www.dlib.org/dlib/november14/beel/11beel.html"
#url = "http://eqa.unibo.it/article/view/4554"
#r = requests.get(url)
html = urllib2.urlopen(url)
soup = BeautifulSoup(html, "html.parser")
#soup = BeautifulSoup(r.text,'lxml')

if url.find("http://www.dlib.org") != -1:
    div = soup.find('td', valign='top')
else:
    div = soup.find('div',id='content')

f = open('path/file_name.html', 'w')
f.write(str(div))
f.close()

Scraping those webpages i've found some nonAScii characters into the html file written from this script that i need to remove or solve into a readable chars. Any advice? Thanks

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T21:33:55+0000

Try to normalize the string and then ASCII encode it ignoring errors.

# -*- coding: utf-8 -*-
from unicodedata import normalize

string = 'ú??§'

if isinstance(string, str):
    string = string.decode('utf-8')

print normalize('NFKD', string).encode('ASCII', 'ignore')
>>> uao

Categories

html - How to remove nonAscii characters in python

html - How to remove nonAscii characters in python

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags