CocoaDev

Edit AllPages

Hi all,

I am now an experienced Cocoa programmer, but one thing has been bugging me for a quite a while now. I moved from VB and VC++, and there was a function which allowed me to call in a file and read through line by line and do something with these lines (usually save into an array). I was wondering, is there any equivalent for Cocoa? I really need it for parsing an HTML file.

If anyone has any better ideas for parsing the file, feel free to enlighten me as I am all ears.

Thanks for you time.


I’ve been programming in Cocoa for about two years (C for about 5), but I’m not much of an HTML expert. Objective C isn’t the easiest language to parse text. “sed” and “awk” are great command line text tools if you know them. If you know Perl, I’ve heard that people have built Cocoa objects that wrap it.

There are a bunch of really knowledgable Cocoa programmers that frequent this site (I’m not one of them), so you have a good chance of getting some useful info.

If you have to parse text with Objective C here are some useful things I’ve come across.

–zootbobbalu


Thanks for that tip. After alot of mucking around, it now works like a charm (I had to add code). Again, thanks. :-)


Could anyone post some source code? I’m trying to figure out how to work around the HTML tags. Thanks.


Take a look at the boost spirit libararies. I use them for my parsing and they work great. They’re C++ libararies and they allow you to inline EBNF grammar descriptions straight into your C++ code and then assembles the parser at compile time (dynamic options also available). You can just use Objective-C++ and have these libararies do the grunt work of parsing. They work just as well as lex and yacc just save you the time of compiling seperate programs and so forth. Here’s a link: http://www.boost.org/libs/spirit/index.html

Code might look something like this:

void handletag(const char * first, const char * last) { std::string str(first,last); std::cout « “This is what’s inside the tag: “ « str; }

rule<> html_tag= ch_p(‘<’) « (*anychar_p)[&handletag] « ch_p(‘>’);

parse(“",html_tag,space_p);


I read the HTML I wanted to parse into an NSString, then used AGRegex to split the input into chunks - tag-oriented rather than line-oriented. (Don’t assume an HTML file is neatly split into lines.) The RE I used was @”[<>] [<>]”. This gave me an NSArray in which element was one of (a) an opening tag, possibly with attributes; (b) element content (i.e. CDATA); or (c) a closing tag.

RickInnis


I’m surprised nobody has mentioned NSScanner yet. It lets you walk over the characters in an array until you hit a particular string, or a character out of an NSCharacterSet, and lots more.

If all you need to do is display HTML, you may want to check out NSAttributedString (which has a method to load HTML-escaped and -styled text) or Apple’s brand-spankin’ new WebKit. The latter provides a “browser view” which you give a remote or local URL and it loads it, including tables, plugins etc. – It’s the same engine Safari uses, and it actually requires Safari 1.0 to be installed.

UliKusterer


with regard to NSScanner, mentioned above: NSScanner is much slower with character sets than with strings. see my NSScannerTimeTrialSourceCode. –boredzo


I’m not convinced that regular expressions are really robust enough to handle HTML in all its mess… Check out the el-kabong HTML parser (ekhtml.sourceforge.net) and the usage in the cocoa app Blapp - blapp.sourceforge.net - the code in question is here: http://cvs.sourceforge.net/viewcvs.py/blapp/blapp/LinkWindowController.m?rev=1.23&view=auto


Having written both a Perl Compatible Regular Expression Library and a HTML 4.01 compliant web-browser I can only second the latter opinion :)

HTML is recursive in structure, regular expressions are not – there exists extensions which allow them to perform recursive matching, but this exclude the possibility of captures, which makes them good only for verification, not “parsing”.

Using NSScanner or similar does not strike me as a benefit either – if we were to parse valid XHTML we could do with something like this:

template struct TagInfo { bool has_children; bool is_close_tag; _Iter begin_of_tag; _Iter end_of_tag; std::string tag_name; };

template TagInfo<_Iter> get_tag_info (_Iter first, _Iter last) { TagInfo<_Iter> res; res.begin_of_tag = first; ++first; // skip '<'; if(res.is_close_tag = *first == '/') ++first;

_Iter begin_of_name = first; static const char ws[] = { ‘ ‘, ‘\n’, ‘\r’, ‘>’ }; first = std::find_first_of(first, last, &ws[0], &ws[sizeofA(ws)]); res.tag_name = std::string(begin_of_name, first);

// parse arguments …

static const char endChar[] = { ‘/’, ‘>’ }; first = std::find_first_of(first, last, &endChar[0], &endChar[sizeofA(endChar)]); res.has_children = !(first != last && *first == ‘/’);

first = std::find(first, last, ‘>’); if(first != last) ++first; res.end_of_tag = first; return res; }

template _Iter parse_xhtml_buffer (_Iter first, _Iter last, Tag& parent) { while(first != last) { _Iter begin_of_text = first; first = std::find(first, last, '<');

  if(begin_of_text != first)
     parent.add_text_node(begin_of_text, first);

  TagInfo<_Iter> const& info = get_tag_info(first, last);
  first = info.end_of_tag;

  if(info.is_close_tag)
     break;

  Tag& newTag = parent.add_tag_with_info(info);
  if(info.has_children)
     first = parse_xhtml_buffer(first, last, newTag);    }    return first; }

I left out the actual Tag structure, it could be:

struct Tag { std::vector children; std::string tag_name;

Tag (std::string name) : tag_name(name) { }

Tag& add_tag_with_info (TagInfo const& info) { children.push_back(Tag(info.tag_name)); return children.back(); }

template void add_text_node (_Iter first, _Iter last) { // well, we should have a tag base class and a text subclass... doh! children.push_back(Tag(std::string(first, last))); } };

The above can be modified to also parse properly nested HTML – the changes are that rather than look after a trailing / in the tag, then a table lookup should be used to check, if the tag can have children.

Furthermore, rather than blindly return when seeing a close tag, it should check if the close tag is the same as the current parent tag, if not, it should still return, but set a state indicating a pending close tag, so that the parent tag does the same check a.s.o., since e.g. “bla</b>” is actually valid HTML (here </b> closes two tags).

A final change is to keep open counts for all tags, because some tags will close others, e.g. <p> will close any previously open <p> tags. Unfortunately these open counts needs to be pushed and popped on some tags (like entering a <table>) – unfortunately the de facto rules for these things are very weird – I think it stems from the first browsers not using recursion to parse HTML, and treating many tags as toggles rather than openers/closers.

That explains why OpenGL code often reminds me of a markup language…


I honestly don’t see why anyone would want to go through all that work when Spirit has provided a very EFFICIENT and LIGHTWEIGHT solution in their libraries (described above). I did basically the same thing in 4 lines as the above. Spirit custom tailors a parser at compile time, and even provides options for error handling. The best part is, it’s almost straight EBNF, so if you can find a good EBNF description of HTML online, you can translate it to the inline C++ Spirit stuff and not even have to worry about writing it correctly.

Because HTML can not be described using EBNF.

Did you not see the example above? Extend that a little and you’re set.

But the example above only parses strict XHTML, HTML has a dozen rules which cannot be expressed in EBNF, some are also mentioned above.

I honestly don’t see anything that wouldn’t be doable with Spirit, maybe not with EBNF(?), but certainly with the Spirit libraries, care to give an example?


Here is one: