.TH HTML 2 .SH NAME parsehtml, printitems, validitems, freeitems, freedocinfo, dimenkind, dimenspec, targetid, targetname, fromStr, toStr \- HTML parser .SH SYNOPSIS .nf .PP .ft L #include #include #include #include .ft P .PP .ta \w'\fLToken* 'u .B Item* parsehtml(uchar* data, int datalen, Rune* src, int mtype, .B int chset, Docinfo** pdi) .PP .B void printitems(Item* items, char* msg) .PP .B int validitems(Item* items) .PP .B void freeitems(Item* items) .PP .B void freedocinfo(Docinfo* d) .PP .B int dimenkind(Dimen d) .PP .B int dimenspec(Dimen d) .PP .B int targetid(Rune* s) .PP .B Rune* targetname(int targid) .PP .B uchar* fromStr(Rune* buf, int n, int chset) .PP .B Rune* toStr(uchar* buf, int n, int chset) .SH DESCRIPTION .PP This library implements a parser for HTML 4.0 documents. The parsed HTML is converted into an intermediate representation that describes how the formatted HTML should be laid out. .PP .I Parsehtml parses an entire HTML document contained in the buffer .I data and having length .IR datalen . The URL of the document should be passed in as .IR src . .I Mtype is the media type of the document, which should be either .B TextHtml or .BR TextPlain . The character set of the document is described in .IR chset , which can be one of .BR US_Ascii , .BR ISO_8859_1 , .B UTF_8 or .BR Unicode . The return value is a linked list of .B Item structures, described in detail below. As a side effect, .BI * pdi is set to point to a newly created .B Docinfo structure, containing information pertaining to the entire document. .PP The library expects two allocation routines to be provided by the caller, .B emalloc and .BR erealloc . These routines are analogous to the standard malloc and realloc routines, except that they should not return if the memory allocation fails. In addition, .B emalloc is required to zero the memory. .PP For debugging purposes, .I printitems may be called to display the contents of an item list; individual items may be printed using the .B %I print verb, installed on the first call to .IR parsehtml . .I validitems traverses the item list, checking that all of the pointers are valid. It returns .B 1 is everything is ok, and .B 0 if an error was found. Normally, one would not call these routines directly. Instead, one sets the global variable .I dbgbuild and the library calls them automatically. One can also set .IR warn , to cause the library to print a warning whenever it finds a problem with the input document, and .IR dbglex , to print debugging information in the lexer. .PP When an item list is finished with, it should be freed with .IR freeitems . Then, .I freedocinfo should be called on the pointer returned in .BI * pdi\f1. .PP .I Dimenkind and .I dimenspec are provided to interpret the .B Dimen type, as described in the section .IR "Dimension Specifications" . .PP Frame target names are mapped to integer ids via a global, permanent mapping. To find the value for a given name, call .IR targetid , which allocates a new id if the name hasn't been seen before. The name of a given, known id may be retrieved using .IR targetname . The library predefines .BR FTtop , .BR FTself , .B FTparent and .BR FTblank . .PP The library handles all text as Unicode strings (type .BR Rune* ). Character set conversion is provided by .I fromStr and .IR toStr . .I FromStr takes .I n Unicode characters from .I buf and converts them to the character set described by .IR chset . .I ToStr takes .I n bytes from .IR buf , interpretted as belonging to character set .IR chset , and converts them to a Unicode string. Both routines null-terminate the result, and use .B emalloc to allocate space for it. .SS Items The return value of .I parsehtml is a linked list of variant structures, with the generic portion described by the following definition: .PP .EX .ta 6n +\w'Genattr* 'u typedef struct Item Item; struct Item { Item* next; int width; int height; int ascent; int anchorid; int state; Genattr* genattr; int tag; }; .EE .PP The field .B next points to the successor in the linked list of items, while .BR width , .BR height , and .B ascent are intended for use by the caller as part of the layout process. .BR Anchorid , if non-zero, gives the integer id assigned by the parser to the anchor that this item is in (see section .IR Anchors ). .B State is a collection of flags and values described as follows: .PP .EX .ta 6n +\w'IFindentshift = 'u enum { IFbrk = 0x80000000, IFbrksp = 0x40000000, IFnobrk = 0x20000000, IFcleft = 0x10000000, IFcright = 0x08000000, IFwrap = 0x04000000, IFhang = 0x02000000, IFrjust = 0x01000000, IFcjust = 0x00800000, IFsmap = 0x00400000, IFindentshift = 8, IFindentmask = (255< element. .B Background is as described in the section .IR "Background Specifications" , and .B backgrounditem is set to be an image item for the document's background image (if given as a URL), or else nil. .B Text gives the default foregound text color of the document, .B link the unvisited hyperlink color, .B vlink the visited hyperlink color, and .B alink the color for highlighting hyperlinks (all in 24-bit RGB format). .B Target is the default target frame id. .B Chset and .B mediatype are as for the .I chset and .I mtype parameters to .IR parsehtml . .B Scripttype is the type of any scripts contained in the document, and is always .BR TextJavascript . .B Hasscripts is set if the document contains any scripts. Scripting is currently unsupported. .B Refresh is the contents of a .B "" tag, if any. .B Kidinfo is set if this document is a frameset (see section .IR Frames ). .B Frameid is this document's frame id. .PP .B Anchors is a list of hyperlinks contained in the document, and .B dests is a list of hyperlink destinations within the page (see the following section for details). .BR Forms , .B tables and .B maps are lists of the various forms, tables and client-side maps contained in the document, as described in subsequent sections. .B Images is a list of all the image items in the document. .SS Anchors .PP The library builds two lists for all of the .B elements (anchors) in a document. Each anchor is assigned a unique anchor id within the document. For anchors which are hyperlinks (the .B href attribute was supplied), the following structure is defined: .PP .EX .ta 6n +\w'Anchor* 'u typedef struct Anchor Anchor; struct Anchor { Anchor* next; int index; Rune* name; Rune* href; int target; }; .EE .PP .B Next points to the next anchor in the list (the head of this list is .BR Docinfo.anchors ). .B Index is the anchor id; each item within this hyperlink is tagged with this value in its .B anchorid field. .B Name and .B href are the values of the correspondingly named attributes of the anchor (in particular, href is the URL to go to). .B Target is the value of the target attribute (if provided) converted to a frame id. .PP Destinations within the document (anchors with the name attribute set) are held in the .B Docinfo.dests list, using the following structure: .PP .EX .ta 6n +\w'DestAnchor* 'u typedef struct DestAnchor DestAnchor; struct DestAnchor { DestAnchor* next; int index; Rune* name; Item* item; }; .EE .PP .B Next is the next element of the list, .B index is the anchor id, .B name is the value of the name attribute, and .B item is points to the item within the parsed document that should be considered to be the destination. .SS Forms .PP Any forms within a document are kept in a list, headed by .BR Docinfo.forms . The elements of this list are as follows: .PP .EX .ta 6n +\w'Formfield* 'u typedef struct Form Form; struct Form { Form* next; int formid; Rune* name; Rune* action; int target; int method; int nfields; Formfield* fields; }; .EE .PP .B Next points to the next form in the list. .B Formid is a serial number for the form within the document. .B Name is the value of the form's name or id attribute. .B Action is the value of any action attribute. .B Target is the value of the target attribute (if any) converted to a frame target id. .B Method is one of .B HGet or .BR HPost . .B Nfields is the number of fields in the form, and .B fields is a linked list of the actual fields. .PP The individual fields in a form are described by the following structure: .PP .EX .ta 6n +\w'Formfield* 'u typedef struct Formfield Formfield; struct Formfield { Formfield* next; int ftype; int fieldid; Form* form; Rune* name; Rune* value; int size; int maxlength; int rows; int cols; uchar flags; Option* options; Item* image; int ctlid; SEvent* events; }; .EE .PP Here, .B next points to the next field in the list. .B Ftype is the type of the field, which can be one of .BR Ftext , .BR Fpassword , .BR Fcheckbox , .BR Fradio , .BR Fsubmit , .BR Fhidden , .BR Fimage , .BR Freset , .BR Ffile , .BR Fbutton , .B Fselect or .BR Ftextarea . .B Fieldid is a serial number for the field within the form. .B Form points back to the form containing this field. .BR Name , .BR value , .BR size , .BR maxlength , .B rows and .B cols each contain the values of corresponding attributes of the field, if present. .B Flags contains per-field flags, of which .B FFchecked and .B FFmultiple are defined. .B Image is only used for fields of type .BR Fimage ; it points to an image item containing the image to be displayed. .B Ctlid is reserved for use by the caller, typically to store a unique id of an associated control used to implement the field. .B Events is the same as the corresponding field of the generic attributes associated with the item containing this field. .B Options is only used by fields of type .BR Fselect ; it consists of a list of possible options that may be selected for that field, using the following structure: .PP .EX .ta 6n +\w'Option* 'u typedef struct Option Option; struct Option { Option* next; int selected; Rune* value; Rune* display; }; .EE .PP .B Next points to the next element of the list. .B Selected is set if this option is to be displayed initially. .B Value is the value to send when the form is submitted if this option is selected. .B Display is the string to display on the screen for this option. .SS Tables .PP The library builds a list of all the tables in the document, headed by .BR Docinfo.tables . Each element of this list has the following format: .PP .EX .ta 6n +\w'Tablecell*** 'u typedef struct Table Table; struct Table { Table* next; int tableid; Tablerow* rows; int nrow; Tablecol* cols; int ncol; Tablecell* cells; int ncell; Tablecell*** grid; Align align; Dimen width; int border; int cellspacing; int cellpadding; Background background; Item* caption; uchar caption_place; Lay* caption_lay; int totw; int toth; int caph; int availw; Token* tabletok; uchar flags; }; .EE .PP .B Next points to the next element in the list of tables. .B Tableid is a serial number for the table within the document. .B Rows is an array of row specifications (described below) and .B nrow is the number of elements in this array. Similarly, .B cols is an array of column specifications, and .B ncol the size of this array. .B Cells is a list of all cells within the table (structure described below) and .B ncell is the number of elements in this list. Note that a cell may span multiple rows and/or columns, thus .B ncell may be smaller than .BR nrow*ncol . .B Grid is a two-dimensional array of cells within the table; the cell at row .B i and column .B j is .BR Table.grid[i][j] . A cell that spans multiple rows and/or columns will be referenced by .B grid multiple times, however it will only occur once in .BR cells . .B Align gives the alignment specification for the entire table, and .B width gives the requested width as a dimension specification. .BR Border , .B cellspacing and .B cellpadding give the values of the corresponding attributes for the table, and .B background gives the requested background for the table. .B Caption is a linked list of items to be displayed as the caption of the table, either above or below depending on whether .B caption_place is .B ALtop or .BR ALbottom . Most of the remaining fields are reserved for use by the caller, except .BR tabletok , which is reserved for internal use. The type .B Lay is not defined by the library; the caller can provide its own definition. .PP The .B Tablecol structure is defined for use by the caller. The library ensures that the correct number of these is allocated, but leaves them blank. The fields are as follows: .PP .EX .ta 6n +\w'Point 'u typedef struct Tablecol Tablecol; struct Tablecol { int width; Align align; Point pos; }; .EE .PP The rows in the table are specified as follows: .PP .EX .ta 6n +\w'Background 'u typedef struct Tablerow Tablerow; struct Tablerow { Tablerow* next; Tablecell* cells; int height; int ascent; Align align; Background background; Point pos; uchar flags; }; .EE .PP .B Next is only used during parsing; it should be ignored by the caller. .B Cells provides a list of all the cells in a row, linked through their .B nextinrow fields (see below). .BR Height , .B ascent and .B pos are reserved for use by the caller. .B Align is the alignment specification for the row, and .B background is the background to use, if specified. .B Flags is used by the parser; ignore this field. .PP The individual cells of the table are described as follows: .PP .EX .ta 6n +\w'Background 'u typedef struct Tablecell Tablecell; struct Tablecell { Tablecell* next; Tablecell* nextinrow; int cellid; Item* content; Lay* lay; int rowspan; int colspan; Align align; uchar flags; Dimen wspec; int hspec; Background background; int minw; int maxw; int ascent; int row; int col; Point pos; }; .EE .PP .B Next is used to link together the list of all cells within a table .RB ( Table.cells ), whereas .B nextinrow is used to link together all the cells within a single row .RB ( Tablerow.cells ). .B Cellid provides a serial number for the cell within the table. .B Content is a linked list of the items to be laid out within the cell. .B Lay is reserved for the user to describe how these items have been laid out. .B Rowspan and .B colspan are the number of rows and columns spanned by this cell, respectively. .B Align is the alignment specification for the cell. .B Flags is some combination of .BR TFparsing , .B TFnowrap and .B TFisth or'd together. Here .B TFparsing is used internally by the parser, and should be ignored. .B TFnowrap means that the contents of the cell should not be wrapped if they don't fit the available width, rather, the table should be expanded if need be (this is set when the nowrap attribute is supplied). .B TFisth means that the cell was created by the .B element (rather than the .B element), indicating that it is a header cell rather than a data cell. .B Wspec provides a suggested width as a dimension specification, and .B hspec provides a suggested height in pixels. .B Background gives a background specification for the individual cell. .BR Minw , .BR maxw , .B ascent and .B pos are reserved for use by the caller during layout. .B Row and .B col give the indices of the row and column of the top left-hand corner of the cell within the table grid. .SS Client-side Maps .PP The library builds a list of client-side maps, headed by .BR Docinfo.maps , and having the following structure: .PP .EX .ta 6n +\w'Rune* 'u typedef struct Map Map; struct Map { Map* next; Rune* name; Area* areas; }; .EE .PP .B Next points to the next element in the list, .B name is the name of the map (use to bind it to an image), and .B areas is a list of the areas within the image that comprise the map, using the following structure: .PP .EX .ta 6n +\w'Dimen* 'u typedef struct Area Area; struct Area { Area* next; int shape; Rune* href; int target; Dimen* coords; int ncoords; }; .EE .PP .B Next points to the next element in the map's list of areas. .B Shape describes the shape of the area, and is one of .BR SHrect , .B SHcircle or .BR SHpoly . .B Href is the URL associated with this area in its role as a hypertext link, and .B target is the target frame it should be loaded in. .B Coords is an array of coordinates for the shape, and .B ncoords is the size of this array (number of elements). .SS Frames .PP If the .B Docinfo.kidinfo field is set, the document is a frameset. In this case, it is typical for .I parsehtml to return nil, as a document which is a frameset should have no actual items that need to be laid out (such will appear only in subsidiary documents). It is possible that items will be returned by a malformed document; the caller should check for this and free any such items. .PP The .B Kidinfo structure itself reflects the fact that framesets can be nested within a document. If is defined as follows: .PP .EX .ta 6n +\w'Kidinfo* 'u typedef struct Kidinfo Kidinfo; struct Kidinfo { Kidinfo* next; int isframeset; // fields for "frame" Rune* src; Rune* name; int marginw; int marginh; int framebd; int flags; // fields for "frameset" Dimen* rows; int nrows; Dimen* cols; int ncols; Kidinfo* kidinfos; Kidinfo* nextframeset; }; .EE .PP .B Next is only used if this structure is part of a containing frameset; it points to the next element in the list of children of that frameset. .B Isframeset is set when this structure represents a frameset; if clear, it is an individual frame. .PP Some fields are used only for framesets. .B Rows is an array of dimension specifications for rows in the frameset, and .B nrows is the length of this array. .B Cols is the corresponding array for columns, of length .BR ncols . .B Kidinfos points to a list of components contained within this frameset, each of which may be a frameset or a frame. .B Nextframeset is only used during parsing, and should be ignored. .PP The remaining fields are used if the structure describes a frame, not a frameset. .B Src provides the URL for the document that should be initially loaded into this frame. Note that this may be a relative URL, in which case it should be interpretted using the containing document's URL as the base. .B Name gives the name of the frame, typically supplied via a name attribute in the HTML. If no name was given, the library allocates one. .BR Marginw , .B marginh and .B framebd are the values of the marginwidth, marginheight and frameborder attributes, respectively. .B Flags can contain some combination of the following: .B FRnoresize (the frame had the noresize attribute set, and the user should not be allowed to resize it), .B FRnoscroll (the frame should not have any scroll bars), .B FRhscroll (the frame should have a horizontal scroll bar), .B FRvscroll (the frame should have a vertical scroll bar), .B FRhscrollauto (the frame should be automatically given a horizontal scroll bar if its contents would not otherwise fit), and .B FRvscrollauto (the frame gets a vertical scrollbar only if required). .SH SOURCE .B /sys/src/libhtml .SH SEE ALSO .IR fmt (1) .PP W3C World Wide Web Consortium, ``HTML 4.01 Specification''. .SH BUGS The entire HTML document must be loaded into memory before any of it can be parsed.