[OGo-Developer] parsing HTML

Marcus Müller developer@opengroupware.org
Wed, 14 Feb 2007 18:43:54 +0100


--Apple-Mail-27--105693690
Content-Transfer-Encoding: 7bit
Content-Type: text/plain;
	charset=US-ASCII;
	delsp=yes;
	format=flowed


On 14.02.2007, at 18:18, Wolfgang Sourdeau wrote:

> On 2007-02-14 11:51:12 -0500 Wolfgang Sourdeau  
> <WSourdeau@Inverse.CA> wrote:
>
>> Hi,
>> Is SaxObjC able to parse HTML or is it limited to XML? Should I  
>> implement my own parser for it?
>
> I guess this is the answer to my own question:
>
>       parser = [[SaxXMLReaderFactory standardXMLReaderFactory]
>                  createXMLReaderWithName: @"libxmlSAXDriver"];

It's better to use

-createXMLReaderForMimeType:@"text/html"

although the result is the same in this case (unless you provide an  
alternative driver which is also capable of parsing HTML). It should  
also be noted that libxml is particularly good at reading b0rked HTML  
and producing a reasonable representation of it.


Cheers,

   Marcus

-- 
Marcus Mueller  .  .  .  crack-admin/coder ;-)
Mulle kybernetiK  .  http://www.mulle-kybernetik.com
Current projects: http://www.mulle-kybernetik.com/znek/



--Apple-Mail-27--105693690
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=ISO-8859-1

<HTML><BODY style=3D"word-wrap: break-word; -khtml-nbsp-mode: space; =
-khtml-line-break: after-white-space; "><BR><DIV><DIV>On 14.02.2007, at =
18:18, Wolfgang Sourdeau wrote:</DIV><BR =
class=3D"Apple-interchange-newline"><BLOCKQUOTE type=3D"cite"><DIV =
style=3D"margin-top: 0px; margin-right: 0px; margin-bottom: 0px; =
margin-left: 0px; ">On 2007-02-14 11:51:12 -0500 Wolfgang Sourdeau =
&lt;<A href=3D"mailto:WSourdeau@Inverse.CA">WSourdeau@Inverse.CA</A>&gt; =
wrote:</DIV><DIV style=3D"margin-top: 0px; margin-right: 0px; =
margin-bottom: 0px; margin-left: 0px; min-height: 14px; "><BR></DIV> =
<BLOCKQUOTE type=3D"cite"><DIV style=3D"margin-top: 0px; margin-right: =
0px; margin-bottom: 0px; margin-left: 0px; ">Hi,</DIV><DIV =
style=3D"margin-top: 0px; margin-right: 0px; margin-bottom: 0px; =
margin-left: 0px; ">Is SaxObjC able to parse HTML or is it limited to =
XML? Should I implement my own parser for it?</DIV> </BLOCKQUOTE><DIV =
style=3D"margin-top: 0px; margin-right: 0px; margin-bottom: 0px; =
margin-left: 0px; min-height: 14px; "><BR></DIV><DIV style=3D"margin-top: =
0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">I guess =
this is the answer to my own question:</DIV><DIV style=3D"margin-top: =
0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; =
min-height: 14px; "><BR></DIV><DIV style=3D"margin-top: 0px; =
margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><SPAN =
class=3D"Apple-converted-space">=A0 =A0 =A0 </SPAN>parser =3D =
[[SaxXMLReaderFactory standardXMLReaderFactory]</DIV><DIV =
style=3D"margin-top: 0px; margin-right: 0px; margin-bottom: 0px; =
margin-left: 0px; "><SPAN class=3D"Apple-converted-space">=A0=A0 =A0 =A0 =
=A0 =A0 =A0 =A0 =A0 </SPAN>createXMLReaderWithName: =
@"libxmlSAXDriver"];</DIV></BLOCKQUOTE></DIV><DIV><BR =
class=3D"khtml-block-placeholder"></DIV>It's better to use<DIV><BR =
class=3D"khtml-block-placeholder"></DIV><DIV>-<FONT =
class=3D"Apple-style-span" face=3D"Monaco" size=3D"2"><SPAN =
class=3D"Apple-style-span" style=3D"font-size: =
10px;">createXMLReaderForMimeType:@"text/html"</SPAN></FONT></DIV><DIV><FO=
NT class=3D"Apple-style-span" face=3D"Monaco" size=3D"2"><SPAN =
class=3D"Apple-style-span" style=3D"font-size: 10px;"><BR =
class=3D"khtml-block-placeholder"></SPAN></FONT></DIV><DIV><FONT =
class=3D"Apple-style-span" color=3D"#000000"><SPAN =
class=3D"Apple-style-span" style=3D"background-color: =
transparent;">although the result is the same in this case (unless you =
provide an alternative driver which is also capable of parsing HTML). It =
should also be noted that libxml is particularly good at reading b0rked =
HTML and producing a reasonable representation of =
it.</SPAN></FONT></DIV><DIV><FONT class=3D"Apple-style-span" =
color=3D"#000000"><SPAN class=3D"Apple-style-span" =
style=3D"background-color: transparent;"><BR =
class=3D"khtml-block-placeholder"></SPAN></FONT></DIV><DIV><FONT =
class=3D"Apple-style-span" face=3D"Monaco" size=3D"2"><SPAN =
class=3D"Apple-style-span" style=3D"font-size: =
10px;"><BR></SPAN></FONT><DIV> <SPAN class=3D"Apple-style-span" =
style=3D"border-collapse: separate; border-spacing: 0px 0px; color: =
rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: =
normal; font-variant: normal; font-weight: normal; letter-spacing: =
normal; line-height: normal; text-align: auto; =
-khtml-text-decorations-in-effect: none; text-indent: 0px; =
-apple-text-size-adjust: auto; text-transform: none; orphans: 2; =
white-space: normal; widows: 2; word-spacing: 0px; "><SPAN =
class=3D"Apple-style-span" style=3D"border-collapse: separate; =
border-spacing: 0px 0px; color: rgb(0, 0, 0); font-family: Helvetica; =
font-size: 12px; font-style: normal; font-variant: normal; font-weight: =
normal; letter-spacing: normal; line-height: normal; text-align: auto; =
-khtml-text-decorations-in-effect: none; text-indent: 0px; =
-apple-text-size-adjust: auto; text-transform: none; orphans: 2; =
white-space: normal; widows: 2; word-spacing: 0px; "><P style=3D"margin: =
0.0px 0.0px 0.0px 0.0px"><FONT face=3D"Helvetica" size=3D"3" =
style=3D"font: 12.0px Helvetica">Cheers,</FONT></P><P style=3D"margin: =
0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica; min-height: =
14.0px"><BR></P><P style=3D"margin: 0.0px 0.0px 0.0px 0.0px"><FONT =
face=3D"Helvetica" size=3D"3" style=3D"font: 12.0px Helvetica"><SPAN =
class=3D"Apple-converted-space">=A0<SPAN =
class=3D"Apple-converted-space">=A0</SPAN></SPAN>Marcus</FONT></P><P =
style=3D"margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica; =
min-height: 14.0px"><BR></P><P style=3D"margin: 0.0px 0.0px 0.0px =
0.0px"><FONT face=3D"Helvetica" size=3D"3" style=3D"font: 12.0px =
Helvetica">--<SPAN class=3D"Apple-converted-space">=A0</SPAN></FONT></P><P=
 style=3D"margin: 0.0px 0.0px 0.0px 0.0px"><FONT face=3D"Helvetica" =
size=3D"3" style=3D"font: 12.0px Helvetica">Marcus Mueller<SPAN =
class=3D"Apple-converted-space">=A0<SPAN =
class=3D"Apple-converted-space">=A0</SPAN></SPAN>.<SPAN =
class=3D"Apple-converted-space">=A0<SPAN =
class=3D"Apple-converted-space">=A0</SPAN></SPAN>.<SPAN =
class=3D"Apple-converted-space">=A0<SPAN =
class=3D"Apple-converted-space">=A0</SPAN></SPAN>.<SPAN =
class=3D"Apple-converted-space">=A0<SPAN =
class=3D"Apple-converted-space">=A0</SPAN></SPAN>crack-admin/coder =
;-)</FONT></P><P style=3D"margin: 0.0px 0.0px 0.0px 0.0px"><FONT =
face=3D"Helvetica" size=3D"3" style=3D"font: 12.0px Helvetica">Mulle =
kybernetiK<SPAN class=3D"Apple-converted-space">=A0<SPAN =
class=3D"Apple-converted-space">=A0</SPAN></SPAN>.<SPAN =
class=3D"Apple-converted-space">=A0<SPAN =
class=3D"Apple-converted-space">=A0</SPAN></SPAN><A =
href=3D"http://www.mulle-kybernetik.com">http://www.mulle-kybernetik.com</=
A></FONT></P><P style=3D"margin: 0.0px 0.0px 0.0px 0.0px"><FONT =
face=3D"Helvetica" size=3D"3" style=3D"font: 12.0px Helvetica">Current =
projects:=A0<A =
href=3D"http://www.mulle-kybernetik.com/znek/">http://www.mulle-kybernetik=
.com/znek/</A></FONT></P><BR =
class=3D"Apple-interchange-newline"></SPAN></SPAN> =
</DIV><BR></DIV></BODY></HTML>=

--Apple-Mail-27--105693690--