Getting international characters from a web page?,
I want to scrape some information off a football (soccer) web page using simple python regexp’s. The problem is that players such as the first chap, Ã„Ã„RITALO, comes out as ÄÄRITALO!
That is, html uses escaped markup for the special characters, such as Ä
Is there a simple way of reading the html into the correct python string? If it was XML/XHTML it would be easy, the parser would do it.
I would recommend for HTML scraping. You also need to tell it to convert HTML entities to the corresponding Unicode characters, like so:
>>> from BeautifulSoup import BeautifulSoup >>> html = "ÄÄRITALO!" >>> soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES) >>> print soup.contents.string Ã„Ã„RITALO!
(It would be nice if the standard module included a codec for this, such that you could do
"some_string".decode('html_entities') but unfortunately it doesn’t!)
Python developer Fredrik Lundh (author of elementtree, among other things) has on his website, which works with decimal, hex and named entities (BeautifulSoup will not work with the hex ones).
That’s the answer Getting international characters from a web page?, Hope this helps those looking for an answer. Then we suggest to do a search for the next question and find the answer only on our site.
The answers provided above are only to be used to guide the learning process. The questions above are open-ended questions, meaning that many answers are not fixed as above. I hope this article can be useful, Thank you