The blog of Tobin

Tobins nerd blog on .NET, Software, Tech and Nice Shiny Gadgets.

Friday, September 15, 2006

Parsing HTML and XHTML in .NET

Bear™ says:
know of any good .NET xhtml parsers that are free?

Bear™ says:
something that will correct tags on the fly

Bear™ says:
like;
myString = HTMLParser.Parse ( mySource )

Tobelerone says:
SgmlReader - use it alot and is good.
http://www.gotdotnet.com/Community/UserSamples/Details.aspx?SampleGuid=B90FDDCE-E60D-43F8-A5C4-C3BD760564BC

Html Agility Pack - is good too
http://sharptoolbox.com/tools/html-agility-pack

Devcomponents HtmlDocument - very good (I use it too) but costs $99
http://www.devcomponents.com/htmldoc/download.html

Chilkat HtmlToXml - haven't used but buying soon!
http://www.chilkatsoft.com/HtmlToXmlDotNet.asp

Bear™ says:
wow

Bear™ says:
thanks!

Using components that convert HTML to XHTML/XML is a great way to go if you need to mine information from web documents.

The best part of these converters is that they take badly written HTML (with broken tags etc) and fix it up as best they can, so you get a well formed XML document which you can work with.
This then lets you do lovely things such as:


//find all link tags on a web page
XmlNodeList linkNodes = xmlDoc.SelectNodes("//a[@href]");


//find all heading tags on a page
XmlNodeList headingNodes = xmlDoc.SelectNodes("//h1 or h2 or h3 or h4");


Good eh!

0 Comments:

Post a Comment

<< Home