I want to scrape some information off a football (soccer) web page using simple python regexp’s. The problem is that players such as the first chap, ÄÄRITALO, comes out as ÄÄRITALO!
That is, html uses escaped markup for the special characters, such as Ä

Is there a simple way of reading the html into the correct python string? If it was XML/XHTML it would be easy, the parser would do it.


I would recommend for HTML scraping. You also need to tell it to convert HTML entities to the corresponding Unicode characters, like so:

>>> from BeautifulSoup import BeautifulSoup      >>> html = "ÄÄRITALO!"  >>> soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)  >>> print soup.contents[0].string  ÄÄRITALO!  

(It would be nice if the standard  module included a codec for this, such that you could do "some_string".decode('html_entities') but unfortunately it doesn’t!)

Another solution:
Python developer Fredrik Lundh (author of elementtree, among other things) has on his website, which works with decimal, hex and named entities (BeautifulSoup will not work with the hex ones).

