Ubuntu Manpages
input texinfo @c -*-texinfo-*- @c %**start of header @setfilename semantic-langdev.info @set TITLE Language Support Developer's Guide @set AUTHOR Eric M. Ludlam, David Ponce, and Richard Y. Kim @settitle @value{TITLE}

@c ************************************************************************* @c @ Header @c *************************************************************************

@c Merge all indexes into a single index for now. @c We can always separate them later into two or more as needed. @syncodeindex vr cp @syncodeindex fn cp @syncodeindex ky cp @syncodeindex pg cp @syncodeindex tp cp

@c @footnotestyle separate @c @paragraphindent 2 @c @@smallbook @c %**end of header

@copying This manual documents Application Development with Semantic.

Copyright @copyright{} 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2007 Eric M. Ludlam Copyright @copyright{} 2001, 2002, 2003, 2004 David Ponce Copyright @copyright{} 2002, 2003 Richard Y. Kim

@quotation Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with the Invariant Sections being list their titles, with the Front-Cover Texts being list, and with the Back-Cover Texts being list. A copy of the license is included in the section entitled ``GNU Free Documentation License''. @end quotation @end copying

@ifinfo @dircategory Emacs @direntry * Semantic Language Writer's guide: (semantic-langdev). @end direntry @end ifinfo

@iftex @finalout @end iftex

@c @setchapternewpage odd @c @setchapternewpage off

@ifinfo This file documents Language Support Development with Semantic. @emph{Infrastructure for parser based text analysis in Emacs}

Copyright @copyright{} 1999, 2000, 2001, 2002, 2003, 2004 @value{AUTHOR} @end ifinfo

@titlepage @sp 10 @title @value{TITLE} @author by @value{AUTHOR} @vskip 0pt plus 1 fill Copyright @copyright{} 1999, 2000, 2001, 2002, 2003, 2004 @value{AUTHOR} @page @vskip 0pt plus 1 fill @insertcopying @end titlepage @page

@c MACRO inclusion @include semanticheader.texi

@c ************************************************************************* @c @ Document @c ************************************************************************* @contents

@node top @top @value{TITLE}

Semantic is bundled with support for several languages such as C, C++, Java, Python, etc. However one of the primary gols of semantic is to provide a framework in which anyone can add support for other languages easily. In order to support a new lanaugage, one typically has to provide a lexer and a parser along with appropriate semantic actions that produce the end result of the parser - the semantic tags.

This chapter first discusses the semantic tag data structure to familiarize the reader to the goal. Then all the components necessary for supporting a lanaugage is discussed starting with the writing lexer, writing the parser, writing semantic rules, etc. Finally several parsers bundled with semantic are discussed as case studies.

@menu * Tag Structure:: * Language Support Overview:: * Writing Lexers:: * Writing Parsers:: * Parsing a language file:: * Debugging:: * Parser Error Handling:: * GNU Free Documentation License:: * Index:: @end menu

@node Tag Structure @chapter Tag Structure @cindex Tag Structure

The end result of the parser for a buffer is a list of @i{tags}. Currently each tag is a list with up to five elements: @example ("NAME" CLASS ATTRIBUTES PROPERTIES OVERLAY) @end example

@var{CLASS} represents what kind of tag this is. Common @var{CLASS} values include @code{variable}, @code{function}, or @code{type}. @inforef{Tag Basics, , semantic-appdev.info}.

@var{ATTRIBUTES} is a slot filled with langauge specific options for the tag. Function arguments, return type, and other flags all are stored in attributes. A language author fills in the ATTRIBUTES with the tag constructor, which is parser style dependant.

@var{PROPERTIES} is a slot generated by the semantic parser harness, and need not be provided by a language author. Programmatically access tag properties with @code{semantic--tag-put-property}, @code{semantic--tag-put-property-no-side-effect} and @code{semantic--tag-get-property}.

@var{OVERLAY} represents positional information for this tag. It is automatically generated by the semantic parser harness, and need not be provided by the language author, unless they provide a tag expansion function via @code{semantic-tag-expand-function}.

The @var{OVERLAY} property is accessed via several functions returning the beginning, end, and buffer of a token. Use these functions unless the overlay is really needed (see @inforef{Tag Query, , app-dev-guide}). Depending on the overlay in a program can be dangerous because sometimes the overlay is replaced with an integer pair @example [ START END ] @end example when the buffer the tag belongs to is not in memory. This happens when a user has activated the Semantic Database @inforef{semanticdb, , semantic-appdev}.

To create tags for a functional or object oriented language, you can use s series of tag creation functions. @inforef{Creating Tags, , semantic-appdev}

@node Language Support Overview @chapter Language Support Overview @cindex Language Support Overview

Starting with version 2.0, @semantic{} provides many ways to add support for a language into the @semantic{} framework.

The primary means to customize how @semantic{} works is to implement language specific versions of @i{overloadable} functions. Semantic has a specialized mode bound way to do this. @ref{Semantic Overload Mechanism}.

The parser has several parts which are all also overloadable. The primary entry point into the parser is @code{semantic-fetch-tags} which calls @code{semantic-parse-region} which returns a list of semantic tags which get set to @code{semantic--buffer-cache}.

@code{semantic-parse-region} is the first ``overloadable'' function. The default behavior of this is to simply call @code{semantic-lex}, then pass the lexical token list to @code{semantic-repeat-parse-whole-stream}. At each stage, another more focused layer provides a means of overloading.

The parser is not the only layer that provides overloadable methods. Application APIs @inforef{top, , semantic-appdev} provide many overload functions as well.

@menu * Semantic Overload Mechanism:: * Semantic Parser Structure:: * Application API Structure:: @end menu

@node Semantic Overload Mechanism @section Semantic Overload Mechanism

one of @semantic{}'s goals is to provide a framework for supporting a wide range of languages. writing parsers for some languages are very simple, e.g., any dialect of lisp family such as emacs-lisp and scheme. parsers for many languages can be written in context free grammars such as c, java, python, etc. on the other hand, it is impossible to specify context free grammars for other languages such as texinfo. Yet @semantic{} already provides parsers for all these languages.

In order to support such wide range of languages, a mechanism for customizing the parser engine at many levels was needed to maximize the code reuse yet give each programmer the flexibility of customizing the parser engine at many levels of granularity. @cindex function overloading @cindex overloading, function The solution that @semantic{} provides is the @i{function overloading} mechanism which allows one to intercept and customize the behavior of many of the functions in the parser engine. First the parser engine breaks down the task of parsing a language into several steps. Each step is represented by an Emacs-Lisp function. Some of these are @code{semantic-parse-region}, @code{semantic-lex}, @code{semantic-parse-stream}, @code{semantic-parse-changes}, etc.

Many built-in @semantic{} functions are declared as being @i{over-loadable} functions, i.e., functions that do reasonable things for most languages, but can be customized to suit the particular needs of a given language. All @i{over-loadable} functions then can easily be @i{over-ridden} if necessary. The rest of this section provides details on this @i{overloading mechanism}.

Over-loadable functions are created by defining functions with the @code{define-overload} macro rather than the usual @code{defun}. @code{define-overload} is a thin wrapper around @code{defun} that sets up the function so that it can be overloaded. An @i{over-loadable} function then can be @i{over-ridden} in one of two ways: @code{define-mode-overload-implementation} and @code{semantic-install-function-overrides}.

Let's look at a couple of examples. @code{semantic-parse-region} is one of the top level functions in the parser engine defined via @code{define-overload}:

@example (define-overload semantic-parse-region
(start end &optional nonterminal depth returnonerror)
"Parse the area between START and END, and return any tokens found.

tokens.") @end example

The document string was truncated in the middle above since it is not relevant here. The macro invocation above defines the @code{semantic-parse-region} Emacs-Lisp function that checks first if there is an overloaded implementation. If one is found, then that is called. If a mode specific implementation is not found, then the default implementation is called which in this case is to call @code{semantic-parse-region-default}, i.e., a function with the same name but with the tailing @i{-default}. That function needs to be written separately and take the same arguments as the entry created with @code{define-overload}.

One way to overload @code{semantic-parse-region} is via @code{semantic-install-function-overrides}. An example from @file{semantic-texi.el} file is shown below:

@example (defun semantic-default-texi-setup ()
"Set up a buffer for parsing of Texinfo files."
;; This will use our parser.
(semantic-install-function-overrides
'((parse-region . semantic-texi-parse-region)
(parse-changes . semantic-texi-parse-changes)))
...
)

(add-hook 'texinfo-mode-hook 'semantic-default-texi-setup) @end example

Above function is called whenever a buffer is setup as texinfo mode. @code{semantic-install-function-overrides} above indicates that @code{semantic-texi-parse-region} is to over-ride the default implementation of @code{semantic-parse-region}. Note the use of @code{parse-region} symbol which is @code{semantic-parse-region} without the leading @i{semantic-} prefix.

Another way to over-ride a built-in @semantic{} function is via @code{define-mode-overload-implementation}. An example from @file{wisent-python.el} file is shown below.

@example (define-mode-overload-implementation
semantic-parse-region python-mode
(start end &optional nonterminal depth returnonerror)
"Over-ride in order to initialize some variables."
(let ((wisent-python-lexer-indent-stack '(0))
(wisent-python-explicit-line-continuation nil))
(semantic-parse-region-default
start end nonterminal depth returnonerror))) @end example

Above over-rides @code{semantic-parse-region} so that for buffers whose major mode is @code{python-mode}, the code specified above is executed rather than the default implementation.

@subsection Why not to use advice

One may wonder why @semantic defines an overload mechanism when Emacs already has advice. @xref{(elisp)Advising Functions}.

Advising is generally considered a mechanism of last resort when modifying or hooking into an existing package without modifying that sourse file. Overload files advertise that they @i{should} be overloaded, and define syntactic sugar to do so.

@node Semantic Parser Structure @section Semantic Parser Structure

NOTE: describe the functions that do parsing, and how to overload each.

@ignore semantic-fetch-tags is the top level function that parses the current buffer.
semantic-parse-changes
semantic-parse-changes-default
semantic-edits-incremental-parser
semantic-parse-region (overloadable)
semantic-parse-region-default
semantic-lex (overloadable)
*semantic-lex-analyzer
semantic-flex
semantic-repeat-parse-whole-stream
semantic-parse-stream (overloadable)
semantic-parse-stream-default
semantic-bovinate-stream (bovine)
wisent-parse-stream (wisent)
semantic-texi-parse-region @end ignore

@example

@ignore semantic-post-change-major-mode-function semantic-parser-name

semantic-toplevel-bovine-table (see semantic-active-p)
semantic-bovinate-stream
semantic-toplevel-bovine-table
semantic-parse-region
semantic-parse-region-default
semantic-lex
semantic-repeat-parse-whole-stream

semantic-init-db-hooks semanticdb-semantic-init-hook-fcn semantic-init-hooks semantic-auto-parse-mode

semantic-flex-keywords-obarray (see semantic-bnf-keyword-table)
Used by semantic-lex-keyword-symbol, semantic-lex-keyword-set,
semantic-lex-map-keywords, semantic-flex semantic-lex-types-obarray

* To support a new language, one must write a set of Emacs-Lisp
functions that converts any valid text written in that language
into a list of semantic tokens. Typically this task is divided into two
areas: a lexer and a parser. * There are many ways of doing this. However in almost all cases, two * Parser converts wisent parsers bovine parsers custom parsers @end ignore

@end example

@node Application API Structure @section Application API Structure

NOTE: improve this:

How to program with the Application programming API into the data structures created by @semantic are in the Application development guide. Read that guide to get a feel for the specifics of what you can customize. @inforef{top, , semantic-appdev}

Here are a list of applications, and the specific APIs that you will need to overload to make them work properly with your language.

@table @code @item imenu @itemx speedbar @itemx ecb These tools requires that the @code{semantic-format} methods create correct strings. @inforef{Format Tag, ,semantic-addpev} @item semantic-analyze The analysis tool requires that the @code{semanticdb} tool is active, and that the searching methods are overloaded. In addition, @code{semanticdb} system database could be written to provide symbols from the global environment of your langauge. @inforef{System Databases, , semantic-appdev}

In addition, the analyzer requires that the @code{semantic-ctxt} methods are overloaded. These methods allow the analyzer to look at the context of the cursor in your language, and predict the type of location of the cursor. @inforef{Derived Context, , semantic-appdev}. @item semantic-idle-summary-mode @itemx semantic-idle-completions-mode These tools use the semantic analysis tool. @end table

@menu * Semantic Analyzer Support:: @end menu

@node Semantic Analyzer Support @subsection Semantic Analyzer Support

@ignore >> From and Email I sent to get David started on supporting the analyzer


First, context parsing needs to work. This includes `semantic-ctxt-current-symbol', `-function', `-assignment'. You also need `semantic-get-local-arguments' and -local-variables'.


The next most critical piece is to provide implementations of the semanticdb-find search path calculation API. `semanticdb-find-table-for-include' is a good start. That really should use `semantic-dependency-tag-file', but that doesn't use semanticdb-project-root when looking for files. Java could be trouble here since you can import a *.


A couple more good ones is `semanticdb-find-translate-path-brutish' and `semanticdb-find-translate-path-includes'. Brutish searches look at everything in the current project. The include path will scan only those items explicitly included into your file.


Last but not least, for Java, we need a semanticdb back end that will provide tags out of a jar file. Since most objects inherit from a system library (like Object), you will need this to get the tag list including `clone' and the like. @end ignore

@node Writing Lexers @chapter Writing Lexers @cindex Writing Lexers

@ignore Are we going to support semantic-flex as well as the new lexer?

Not in the doc - Eric @end ignore

In order to reduce a source file into a tag table, it must first be converted into a token stream. Tokens are syntactic elements such as whitespace, symbols, strings, lists, and punctuation.

The lexer uses the major-mode's syntax table for conversion. @xref{Syntax Tables,,,elisp}. As long as that is set up correctly (along with the important @code{comment-start} and @code{comment-start-skip} variable) the lexer should already work for your language.

The primary entry point of the lexer is the @dfn{semantic-lex} function shown below. Normally, you do not need to call this function. It is usually called by @emph{semantic-fetch-tags} for you.

@anchor{semantic-lex} @defun semantic-lex start end &optional depth length Lexically analyze text in the current buffer between @var{START} and @var{END}. Optional argument @var{DEPTH} indicates at what level to scan over entire lists. The last argument, @var{LENGTH} specifies that @dfn{semantic-lex} should only return @var{LENGTH} tokens. The return value is a token stream. Each element is a list, such of the form
(symbol start-expression . end-expression) where @var{SYMBOL} denotes the token type. See @code{semantic-lex-tokens} variable for details on token types. @var{END} does not mark the end of the text scanned, only the end of the beginning of text scanned. Thus, if a string extends past @var{END}, the end of the return token will be larger than @var{END}. To truly restrict scanning, use @dfn{narrow-to-region}. @end defun

@menu * Lexer Overview:: What is a Lexer? * Lexer Output:: Output of a Lexical Analyzer * Lexer Construction:: Constructing your own lexer * Lexer Built In Analyzers:: Built in analyzers you can use * Lexer Analyzer Construction:: Constructing your own anlyzers * Keywords:: Specialized lexical tokens. * Keyword Properties:: @end menu

@node Lexer Overview @section Lexer Overview

@semantic lexer breaks up the content of an Emacs buffer into a stream of tokens. This process is based mostly on regular expressions which in turn depend on the syntax table of the buffer's major mode being setup properly. @xref{Major Modes,,,emacs}. @xref{Syntax Tables,,,elisp}. @xref{Regexps,,,emacs}.

The top level lexical function @dfn{semantic-lex}, calls the function stored in @dfn{semantic-lex-analyzer}. The default value is the function @dfn{semantic-flex} from version 1.4 of Semantic. This will eventually be depricated.

In the default lexer, the following regular expressions which rely on syntax tables are used:

@table @code @item @code{\s-} whitespace characters @item @code{\sw} word constituent @item @code{\s_} symbol constituent @item @code{\s.} punctuation character @item @code{\s<} comment starter @item @code{\s>} comment ender @item @code{\s\} escape character @item @code{\s)} close parenthesis character @item @code{\s$} paired delimiter @item @code{\s string quote @item @code{\s´} expression prefix @end table

In addition, Emacs' built-in features such as @code{comment-start-skip}, @code{forward-comment}, @code{forward-list}, and @code{forward-sexp} are employed.

@node Lexer Output @section Lexer Output

The lexer, @ref{semantic-lex}, scans the content of a buffer and returns a token list. Let's illustrate this using this simple example.

@example 00: /* 01: * Simple program to demonstrate semantic. 02: */ 03: 04: #include <stdio.h> 05: 06: int i_1; 07: 08: int 09: main(int argc, char** argv) 10: @{ 11: printf("Hello world.0); 12: @} @end example

Evaluating @code{(semantic-lex (point-min) (point-max))} within the buffer with the code above returns the following token list. The input line and string that produced each token is shown after each semi-colon.

@example ((punctuation 52 . 53) ; 04: #
(INCLUDE 53 . 60) ; 04: include
(punctuation 61 . 62) ; 04: <
(symbol 62 . 67) ; 04: stdio
(punctuation 67 . 68) ; 04: .
(symbol 68 . 69) ; 04: h
(punctuation 69 . 70) ; 04: >
(INT 72 . 75) ; 06: int
(symbol 76 . 79) ; 06: i_1
(punctuation 79 . 80) ; 06: ;
(INT 82 . 85) ; 08: int
(symbol 86 . 90) ; 08: main
(semantic-list 90 . 113) ; 08: (int argc, char** argv)
(semantic-list 114 . 147) ; 09-12: body of main function
) @end example

As shown above, the token list is a list of ``tokens''. Each token in turn is a list of the form

@example (TOKEN-TYPE BEGINNING-POSITION . ENDING-POSITION) @end example

@noindent where TOKEN-TYPE is a symbol, and the other two are integers indicating the buffer position that delimit the token such that

@lisp (buffer-substring BEGINNING-POSITION ENDING-POSITION) @end lisp

@noindent would return the string form of the token.

Note that one line (line 4 above) can produce seven tokens while the whole body of the function produces a single token. This is because the @var{depth} parameter of @code{semantic-lex} was not specified. Let's see the output when @var{depth} is set to 1. Evaluate @code{(semantic-lex (point-min) (point-max) 1)} in the same buffer. Note the third argument of @code{1}.

@example ((punctuation 52 . 53) ; 04: #
(INCLUDE 53 . 60) ; 04: include
(punctuation 61 . 62) ; 04: <
(symbol 62 . 67) ; 04: stdio
(punctuation 67 . 68) ; 04: .
(symbol 68 . 69) ; 04: h
(punctuation 69 . 70) ; 04: >
(INT 72 . 75) ; 06: int
(symbol 76 . 79) ; 06: i_1
(punctuation 79 . 80) ; 06: ;
(INT 82 . 85) ; 08: int
(symbol 86 . 90) ; 08: main


(open-paren 90 . 91) ; 08: (
(INT 91 . 94) ; 08: int
(symbol 95 . 99) ; 08: argc
(punctuation 99 . 100) ; 08: ,
(CHAR 101 . 105) ; 08: char
(punctuation 105 . 106) ; 08: *
(punctuation 106 . 107) ; 08: *
(symbol 108 . 112) ; 08: argv
(close-paren 112 . 113) ; 08: )


(open-paren 114 . 115) ; 10: @{
(symbol 120 . 126) ; 11: printf
(semantic-list 126 . 144) ; 11: ("Hello world.0)
(punctuation 144 . 145) ; 11: ;
(close-paren 146 . 147) ; 12: @}
) @end example

The @var{depth} parameter ``peeled away'' one more level of ``list'' delimited by matching parenthesis or braces. The depth parameter can be specified to be any number. However, the parser needs to be able to handle the extra tokens.

This is an interesting benefit of the lexer having the full resources of Emacs at its disposal. Skipping over matched parenthesis is achieved by simply calling the built-in functions @code{forward-list} and @code{forward-sexp}.

@node Lexer Construction @section Lexer Construction

While using the default lexer is certainly an option, particularly for grammars written in semantic 1.4 style, it is usually more efficient to create a custom lexer for your language.

You can create a new lexer with @dfn{define-lex}.

@defun define-lex name doc &rest analyzers @anchor{define-lex} Create a new lexical analyzer with @var{NAME}. @var{DOC} is a documentation string describing this analyzer. @var{ANALYZERS} are small code snippets of analyzers to use when building the new @var{NAMED} analyzer. Only use analyzers which are written to be used in @dfn{define-lex}. Each analyzer should be an analyzer created with @dfn{define-lex-analyzer}. Note: The order in which analyzers are listed is important. If two analyzers can match the same text, it is important to order the analyzers so that the one you want to match first occurs first. For example, it is good to put a numbe analyzer in front of a symbol analyzer which might mistake a number for as a symbol. @end defun

The list of @var{analyzers}, needed here can consist of one of several built in analyzers, or one of your own construction. The built in analyzers are:

@node Lexer Built In Analyzers @section Lexer Built In Analyzers

@defspec semantic-lex-default-action The default action when no other lexical actions match text. This action will just throw an error. @end defspec

@defspec semantic-lex-beginning-of-line Detect and create a beginning of line token (BOL). @end defspec

@defspec semantic-lex-newline Detect and create newline tokens. @end defspec

@defspec semantic-lex-newline-as-whitespace Detect and create newline tokens. Use this ONLY if newlines are not whitespace characters (such as when they are comment end characters) AND when you want whitespace tokens. @end defspec

@defspec semantic-lex-ignore-newline Detect and create newline tokens. Use this ONLY if newlines are not whitespace characters (such as when they are comment end characters). @end defspec

@defspec semantic-lex-whitespace Detect and create whitespace tokens. @end defspec

@defspec semantic-lex-ignore-whitespace Detect and skip over whitespace tokens. @end defspec

@defspec semantic-lex-number Detect and create number tokens. Number tokens are matched via this variable:

@defvar semantic-lex-number-expression Regular expression for matching a number. If this value is @code{nil}, no number extraction is done during lex. This expression tries to match C and Java like numbers.

@example DECIMAL_LITERAL:
[1-9][0-9]*
; HEX_LITERAL:
0[xX][0-9a-fA-F]+
; OCTAL_LITERAL:
0[0-7]*
; INTEGER_LITERAL:
<DECIMAL_LITERAL>[lL]?
| <HEX_LITERAL>[lL]?
| <OCTAL_LITERAL>[lL]?
; EXPONENT:
[eE][+-]?[09]+
; FLOATING_POINT_LITERAL:
[0-9]+[.][0-9]*<EXPONENT>?[fFdD]?
| [.][0-9]+<EXPONENT>?[fFdD]?
| [0-9]+<EXPONENT>[fFdD]?
| [0-9]+<EXPONENT>?[fFdD]
; @end example @end defvar

@end defspec

@defspec semantic-lex-symbol-or-keyword Detect and create symbol and keyword tokens. @end defspec

@defspec semantic-lex-charquote Detect and create charquote tokens. @end defspec

@defspec semantic-lex-punctuation Detect and create punctuation tokens. @end defspec

@defspec semantic-lex-punctuation-type Detect and create a punctuation type token. Recognized punctuations are defined in the current table of lexical types, as the value of the `punctuation' token type. @end defspec

@defspec semantic-lex-paren-or-list Detect open parenthesis. Return either a paren token or a semantic list token depending on `semantic-lex-current-depth'. @end defspec

@defspec semantic-lex-open-paren Detect and create an open parenthisis token. @end defspec

@defspec semantic-lex-close-paren Detect and create a close paren token. @end defspec

@defspec semantic-lex-string Detect and create a string token. @end defspec

@defspec semantic-lex-comments Detect and create a comment token. @end defspec

@defspec semantic-lex-comments-as-whitespace Detect comments and create a whitespace token. @end defspec

@defspec semantic-lex-ignore-comments Detect and create a comment token. @end defspec

@node Lexer Analyzer Construction @section Lexer Analyzer Construction

Each of the previous built in analyzers are constructed using a set of analyzer construction macros. The root construction macro is:

@defun define-lex-analyzer name doc condition &rest forms Create a single lexical analyzer @var{NAME} with @var{DOC}. When an analyzer is called, the current buffer and point are positioned in a buffer at the location to be analyzed. @var{CONDITION} is an expression which returns @code{t} if @var{FORMS} should be run. Within the bounds of @var{CONDITION} and @var{FORMS}, the use of backquote can be used to evaluate expressions at compile time. While forms are running, the following variables will be locally bound:
@code{semantic-lex-analysis-bounds} - The bounds of the current analysis.
of the form (@var{START} . @var{END})
@code{semantic-lex-maximum-depth} - The maximum depth of semantic-list
for the current analysis.
@code{semantic-lex-current-depth} - The current depth of @code{semantic-list} that has
been decended.
@code{semantic-lex-end-point} - End Point after match.
Analyzers should set this to a buffer location if their
match string does not represent the end of the matched text.
@code{semantic-lex-token-stream} - The token list being collected.
Add new lexical tokens to this list. Proper action in @var{FORMS} is to move the value of @code{semantic-lex-end-point} to after the location of the analyzed entry, and to add any discovered tokens at the beginning of @code{semantic-lex-token-stream}. This can be done by using @dfn{semantic-lex-push-token}. @end defun

Additionally, a simple regular expression based analyzer can be built with:

@defun define-lex-regex-analyzer name doc regexp &rest forms Create a lexical analyzer with @var{NAME} and @var{DOC} that will match @var{REGEXP}. @var{FORMS} are evaluated upon a successful match. See @dfn{define-lex-analyzer} for more about analyzers. @end defun

@defun define-lex-simple-regex-analyzer name doc regexp toksym &optional index &rest forms Create a lexical analyzer with @var{NAME} and @var{DOC} that match @var{REGEXP}. @var{TOKSYM} is the symbol to use when creating a semantic lexical token. @var{INDEX} is the index into the match that defines the bounds of the token. Index should be a plain integer, and not specified in the macro as an expression. @var{FORMS} are evaluated upon a successful match @var{BEFORE} the new token is created. It is valid to ignore @var{FORMS}. See @dfn{define-lex-analyzer} for more about analyzers. @end defun

Regular expression analyzers are the simplest to create and manage. Often, a majority of your lexer can be built this way. The analyzer for matching punctuation looks like this:

@example (define-lex-simple-regex-analyzer semantic-lex-punctuation
"Detect and create punctuation tokens."
"\(\s.\|\s$\|\s'\)" 'punctuation) @end example

More complex analyzers for matching larger units of text to optimize the speed of parsing and analysis is done by matching blocks.

@defun define-lex-block-analyzer name doc spec1 &rest specs Create a lexical analyzer @var{NAME} for paired delimiters blocks. It detects a paired delimiters block or the corresponding open or close delimiter depending on the value of the variable @code{semantic-lex-current-depth}. @var{DOC} is the documentation string of the lexical analyzer. @var{SPEC1} and @var{SPECS} specify the token symbols and open, close delimiters used. Each @var{SPEC} has the form:

(@var{BLOCK-SYM} (@var{OPEN-DELIM} @var{OPEN-SYM}) (@var{CLOSE-DELIM} @var{CLOSE-SYM}))

where @var{BLOCK-SYM} is the symbol returned in a block token. @var{OPEN-DELIM} and @var{CLOSE-DELIM} are respectively the open and close delimiters identifying a block. @var{OPEN-SYM} and @var{CLOSE-SYM} are respectively the symbols returned in open and close tokens. @end defun

These blocks is what makes the speed of semantic's Emacs Lisp based parsers fast. For exmaple, by defining all text inside @{ braces @} as a block the parser does not need to know the contents of those braces while parsing, and can skip them all together.

@node Keywords @section Keywords

Another important piece of the lexer is the keyword table (see @ref{Writing Parsers}). You language will want to set up a keyword table for fast conversion of symbol strings to language terminals.

The keywords table can also be used to store additional information about those keywords. The following programming functions can be useful when examining text in a language buffer.

@defun semantic-lex-keyword-p name Return non-@code{nil} if a keyword with @var{NAME} exists in the keyword table. Return @code{nil} otherwise. @end defun

@defun semantic-lex-keyword-put name property value For keyword with @var{NAME}, set its @var{PROPERTY} to @var{VALUE}. @end defun

@defun semantic-lex-keyword-get name property For keyword with @var{NAME}, return its @var{PROPERTY} value. @end defun

@defun semantic-lex-map-keywords fun &optional property Call function @var{FUN} on every semantic keyword. If optional @var{PROPERTY} is non-@code{nil}, call @var{FUN} only on every keyword which as a @var{PROPERTY} value. @var{FUN} receives a semantic keyword as argument. @end defun

@defun semantic-lex-keywords &optional property Return a list of semantic keywords. If optional @var{PROPERTY} is non-@code{nil}, return only keywords which have a @var{PROPERTY} set. @end defun

Keyword properties can be set up in a grammar file for ease of maintenance. While examining the text in a language buffer, this can provide an easy and quick way of storing details about text in the buffer.

@node Keyword Properties @section Standard Keyword Properties

Keywords in a language can have multiple properties. These properties can be used to associate the string that is the keyword with additional information.

Currently available properties are:

@table @b @item summary The summary property is used by semantic-summary-mode as a help string for the keyword specified. @end table

Notes:

Possible future properties. This is just me musing:

@table @b @item face Face used for highlighting this keyword, differentiating it from the keyword face. @item template @itemx skeleton Some sort of tempo/skel template for inserting the programatic structure associated with this keyword. @item abbrev As with template. @item action @itemx menu Perhaps the keyword is clickable and some action would be useful. @end table

@node Writing Parsers @chapter Writing Parsers @cindex Writing Parsers

@ignore For the parser developer, I can think of two extra sections. One for semanticdb extensions, (If a system database is needed.) A second for the `semantic-ctxt' extensions. Many of the most interesting tools will completely fail to work without local context parsing support.

Perhaps even a section on foreign tokens. For example, putting a Java token into a C++ file could auto-gen a native method, just as putting a token into a Texinfo file converts it into documentation.

In addition, in the "writing grammars" section should have subsections as listed in the examples of the overview section. It might be useful to have a fourth section describing the similarities between the two file types (by and wy) and how to use the grammar mode. (I'm not sure if that should be covered elsewhere.) @end ignore

When converting a source file into a tag table it is important to specify rules to accomplish this. The rules are stored in the buffer local variable @code{semantic--buffer-cache}.

While it is certainly possible to write this table yourself, it is most likely you will want to use the @ref{Grammar Programming Environment}.

There are three choices for parsing your language.

@table @b @item Bovine Parser The @dfn{bovine} parser is the original @semantic{} parser, and is an implementation of an @acronym{LL} parser. For more information, @inforef{top, the Bovine Parser Manual, bovine}.

@item Wisent Parser The @dfn{wisent} parser is a port of the GNU Compiler Compiler Bison to Emacs Lisp. Wisent includes the iterative error handler of the bovine parser, and has the same error correction as traditional @acronym{LALR} parsers. For more information, @inforef{top, the Wisent Parser Manual, wisent}.

@item External Parser External parsers, such as the texinfo parser can be implemented using any means. This allows the use of a regular expression parser for non-regular languages, or external programs for speed. @end table

@menu * External Parsers:: Writing an external parser * Grammar Programming Environment:: Using the grammar writing environemt * Parser Backend Support:: Lisp needed to support a grammar. @end menu

@node External Parsers @section External Parsers

The texinfo parser in @file{semantic-texi.el} is an example of an external parser. To make your parser work, you need to have a setup function.

Note: Finish this.

@node Grammar Programming Environment @section Grammar Programming Environment

Semantic grammar files in @file{.by} or @file{.wy} format have their own programming mode. This mode provides indentation and coloring services in those languages. In addition, the grammar languages are also supported by @semantic so tagging information is available to tools such as imenu or speedbar.

For more information, @inforef{top, the Grammar Framework Manual, grammar-fw}.

@node Parsing a language file @chapter Parsing a language file

The best way to call the parser from programs is via @code{semantic-fetch-tags}. This, in turn, uses other internal @acronym{API} functions which plug-in parsers can take advantage of.

@defun semantic-fetch-tags @anchor{semantic-fetch-tags} Fetch semantic tags from the current buffer. If the buffer cache is up to date, return that. If the buffer cache is out of date, attempt an incremental reparse. If the buffer has not been parsed before, or if the incremental reparse fails, then parse the entire buffer. If a lexcial error had been previously discovered and the buffer was marked unparseable, then do nothing, and return the cache. @end defun

Another approach is to let Emacs call the parser on idle time, when needed, then use @code{semantic-fetch-available-tags} to retrieve and process only the available tags, provided that the @code{semantic-after-*-hook} hooks have been setup to synchronize with new tags when they become available.

@defun semantic-fetch-available-tags @anchor{semantic-fetch-available-tags} Fetch available semantic tags from the current buffer. That is, return tags currently in the cache without parsing the current buffer.

Parse operations happen asynchronously when needed on Emacs idle time. Use the @code{semantic-after-toplevel-cache-change-hook} and @code{semantic-after-partial-cache-change-hook} hooks to synchronize with new tags when they become available. @end defun

@deffn Command semantic-clear-toplevel-cache @anchor{semantic-clear-toplevel-cache} Clear the toplevel tag cache for the current buffer. Clearing the cache will force a complete reparse next time a token stream is requested. @end deffn

@menu * Parser Backend Support:: Parser backend support. @end menu

@node Parser Backend Support @section Parser Backend Support

Once you have written a grammar file that has been compiled into Emacs Lisp code, additional glue needs to be written to finish connecting the generated parser into the Emacs framework.

Large portions of this glue is automatically generated, but will probably need additional modification to get things to work properly.

Typically, a grammar file @file{foo.wy} will create the file @file{foo-wy.el}. It is then useful to also create a file @file{wisent-foo.el} (or @file{sematnic-foo.el}) to contain the parser back end, or the glue that completes the semantic support for the language.

@menu * Example Backend File:: * Tag Expansion:: @end menu

@node Example Backend File @subsection Example Backend File

Typical structure for this file is:

@example ;;; semantic-foo.el -- parser support for FOO.

;;; Your copyright Notice

(require 'foo-wy.el) ;; The parser (require 'foo) ;; major mode definition for FOO

;;; Code:

;;; Lexical Analyzer ;; ;; OPTIONAL ;; It is possible to define your lexical analyzer completely in your ;; grammar file.

(define-lex foo-lexical-analyzer
"Create a lexical analyzer."
...)

;;; Expand Function ;; ;; OPTIONAL ;; Not all langauges are so complex as to need this function. ;; See `semantic-tag-expand-function' for more details. (defun foo-tag-expand-function (tag)
"Expand TAG into multiple tags if needed."
...)

;;; Parser Support ;; ;; OPTIONAL ;; If you need some specialty routines inside your grammar file ;; you can add some here. The process may be to take diverse info ;; and reorganize it. ;; ;; It is also appropriate to write these functions in the prologue ;; of the grammar function. (defun foo-do-something-hard (...)
"...")

;;; Overload methods ;; ;; OPTIONAL ;; To allow your langauge to be fully supported by all the ;; applications that use semantic, it is important, but not necessary ;; to create implementations of overload methods. (define-mode-overload-implementation some-semantic-function foo-mode (tag)
"Implement some-semantic-function for FOO."
)

;;;###autoload (defun semantic-default-foo-setup ()
"Set up a buffer for semantic parsing of the FOO language."
(semantic-foo-by--install-parser)
(setq semantic-tag-expand-function foo-tag-expand-function
;; Many other language specific settings can be done here
;; as well.
)
;; This may be optional
(setq semantic-lex-analyzer #'foo-lexical-analyzer)
)

;;;###autoload (add-hook 'foo-mode-hook 'semantic-default-foo-setup)

(provide 'semantic-c)

;;; semantic-foo.el ends here @end example

@node Tag Expansion @subsection Tag Expansion

In any language with compound tag types, you will need to implement an @emph{expand function}. Once written, assign it to this variable.

@defvar semantic-tag-expand-function @anchor{semantic-tag-expand-function} Function used to expand a tag. It is passed each tag production, and must return a list of tags derived from it, or @code{nil} if it does not need to be expanded.

Languages with compound definitions should use this function to expand from one compound symbol into several. For example, in @var{C} or Java the following definition is easily parsed into one tag:


int a, b;

This function should take this compound tag and turn it into two tags, one for @var{A}, and the other for @var{B}. @end defvar

Additionally, you can use the expand function in conjuntion with your language for other types of compound statements. For example, in Common Lisp Object System, you can have a definition:

@example (defclass classname nil
(slots ...) ...) @end example

This will create both the datatype @code{classname} and the functional constructor @code{classname}. Each slot may have a @code{:accessor} method as well.

You can create a special compounded tag in your rule, for example:

@example classdef: LPAREN DEFCLASS name semantic-list semantic-list RPAREN
(TAG "custom" 'compound-class
:value (list
(TYPE-TAG $3 "class" ...)
(FUNCTION-TAG $3 ...)
))
; @end example

and in your expand function, you would write:

@example (defun my-tag-expand (tag)
"Expand tags for my langauge."
(when (semantic-tag-of-class-p tag 'compound-class)
(remq nil
(semantic-tag-get-attribute tag :value)))) @end example

This will cause the custom tag to be replaced by the tags created in the :value attribute of the specially constructed tag.

@node Debugging @chapter Debugging

Grammars can be tricky things to debug. There are several types of tools for debugging in Semantic, and the type of problem you have requires different types of tools.

@menu * Lexical Debugging:: * Parser Output tools:: * Bovine Parser Debugging:: * Wisent Parser Debugging:: * Overlay Debugging:: * Incremental Parser Debugging:: * Debugging Analysis:: * Semantic 1.4 Doc:: @end menu

@node Lexical Debugging @section Lexical Debugging

The first major problem you may encounter is with lexical analysis. If the text is not transformed into the expected token stream, no parser will understand it.

You can step through the lexical analyzer with the following command:

@deffn Command semantic-lex-debug arg @anchor{semantic-lex-debug} Debug the semantic lexer in the current buffer. Argument @var{ARG} specifies of the analyze the whole buffer, or start at point. While engaged, each token identified by the lexer will be highlighted in the target buffer @var{A} description of the current token will be displayed in the minibuffer. Press @kbd{SPC} to move to the next lexical token. @end deffn

For an example of what the output of the @code{semantic-lex} function should return, see @ref{Lexer Output}.

@node Parser Output tools @section Parser Output tools

There are several tools which can be used to see what the parser output is. These will work for any type of parser, including the Bovine parser, Wisent parser.

The first and easiest is a minor mode which highlights text the parser did not understand.

@deffn Command semantic-show-unmatched-syntax-mode &optional arg @anchor{semantic-show-unmatched-syntax-mode} Minor mode to highlight unmatched lexical syntax tokens. When a parser executes, some elements in the buffer may not match any parser rules. These text characters are considered unmatched syntax. Often time, the display of unmatched syntax can expose coding problems before the compiler is run.

With prefix argument @var{ARG}, turn on if positive, otherwise off. The minor mode can be turned on only if semantic feature is available and the current buffer was set up for parsing. Return non-@code{nil} if the minor mode is enabled.

@table @kbd @item key binding @item C-c , Prefix Command @item C-c , ` semantic-show-unmatched-syntax-next @end table @end deffn

Another interesting mode will display a line between all the tags in the current buffer to make it more obvious where boundaries lie. You can enable this as a minor mode.

@deffn Command semantic-show-tag-boundaries-mode &optional arg @anchor{semantic-show-tag-boundaries-mode} Minor mode to display a boundary in front of tags. The boundary is displayed using an overline in Emacs @var{21}. With prefix argument @var{ARG}, turn on if positive, otherwise off. The minor mode can be turned on only if semantic feature is available and the current buffer was set up for parsing. Return non-@code{nil} if the minor mode is enabled. @end deffn

Another interesting mode helps if you are worred about specific attributes, you can se this minor mode to highlight different tokens in different ways based on the attributes you are most concerned with.

@deffn Command semantic-highlight-by-attribute-mode &optional arg @anchor{semantic-highlight-by-attribute-mode} Minor mode to highlight tags based on some attribute. By default, the protection of a tag will give it a different background color.

With prefix argument @var{ARG}, turn on if positive, otherwise off. The minor mode can be turned on only if semantic feature is available and the current buffer was set up for parsing. Return non-@code{nil} if the minor mode is enabled. @end deffn

Another tool that can be used is a dump of the current list of tags. This shows the actual Lisp representation of the tags generated in a rather bland dump. This can be useful if text was successfully parsed, and you want to be sure that the correct information was captured.

@deffn Command bovinate &optional clear @anchor{bovinate} Bovinate the current buffer. Show output in a temp buffer. Optional argument @var{CLEAR} will clear the cache before bovinating. If @var{CLEAR} is negative, it will do a full reparse, and also not display the output buffer. @end deffn

@node Bovine Parser Debugging @section Bovine Parser Debugging

The bovine parser is described in @inforef{top, ,bovine}.

Asside using a traditional Emacs Lisp debugger on functions you provide for token expansion, there is one other means of debugging which interactively steps over the rules in your grammar file.

@deffn Command semantic-debug @anchor{semantic-debug} Parse the current buffer and run in debug mode. @end deffn

Once the parser is activated in this mode, the current tag cache is flushed, and the parser started. At each stage in the LALR parser, the current rule, and match step is highlighted in your parser source buffer. In a second window, the text being parsed is shown, and the lexical token found is highlighted. A clue of the current stack of saved data is displayed in the minibuffer.

There is a wide range of keybindings that can be used to execute code in your buffer. (Not all are implemented.)

@table @kbd @item n @itemx SPC Next. @item s Step. @item u Up. (Not implemented yet.) @item d Down. (Not implemented yet.) @item f Fail Match. Pretend the current match element and the token in the buffer is a failed match, even if it is not. @item h Print information about the current parser state. @item s Jump to to the source buffer. @item p Jump to the parser buffer. @item q Quit. Exits this debug session and the parser. @item a Abort. Aborts one level of the parser, possibly exiting the debugger. @item g Go. Stop debugging, and just start parsing. @item b Set Breakpoint. (Not implemented yet.) @item e @code{eval-expression}. Lets you execute some random Emacs Lisp command. @end table

@b{Note:} While the core of @code{semantic-debug} is a generic debugger interface for rule based grammars, only the bovine parser has a specific backend implementation. If someone wants to implement a debugger backend for wisent, that would be spiff.

@node Wisent Parser Debugging @section Wisent Parser Debugging

Wisent does not implement a backend for @code{semantic-debug}, it does have some debugging commands the rule actions. You can read about them in the wisent manual.

@inforef{Grammar Debugging, , wisent}

@node Overlay Debugging @section Overlay Debugging

Once a buffer has been parsed into a tag table, the next most important step is getting those tags activated for a buffer, and storable in a @code{semanticdb} backend. @inforef{semanticdb, , semantic-appdev}.

These two activities depend on the ability of every tag in the table to be linked and unlinked to the current buffer with an overlay. @inforef{semantic-appdev, , Tag Overlay} @inforef{semantic-appdev, , Tag Hooks}

In this case, the most important function that must be written is:

@defun semantic-tag-components-with-overlays tag @anchor{semantic-tag-components-with-overlays} Return the list of top level components belonging to @var{TAG}. Children are any sub-tags which contain overlays.

Default behavior is to get @dfn{semantic-tag-components} in addition to the components of an anonymous types (if applicable.)

Note for language authors:
If a mode defines a language tag that has tags in it with overlays you should still return them with this function. Ignoring this step will prevent several features from working correctly. This function can be overriden in semantic using the symbol @code{tag-components-with-overlays}. @end defun

If your are successfully building a tag table, and errors occur saving or restoring tags from semanticdb, this is the most likely cause of the problem.

@node Incremental Parser Debugging @section Incremental Parser Debugging

The incremental parser is a highly complex engine for quickly refreshing the tag table of a buffer after some set of changes have been made to that buffer by a user.

There is no debugger or interface to the incremental parser, however there are a few minor modes which can help you identify issues if you think there are problems while incrementally parsing a buffer.

The first stage of the incremental parser is in tracking the changes the user makes to a buffer. You can visibly track these changes too.

@deffn Command semantic-highlight-edits-mode &optional arg @anchor{semantic-highlight-edits-mode} Minor mode for highlighting changes made in a buffer. Changes are tracked by semantic so that the incremental parser can work properly. This mode will highlight those changes as they are made, and clear them when the incremental parser accounts for those edits. With prefix argument @var{ARG}, turn on if positive, otherwise off. The minor mode can be turned on only if semantic feature is available and the current buffer was set up for parsing. Return non-@code{nil} if the minor mode is enabled. @end deffn

Another important aspect of the incremental parser involves tracking the current parser state of the buffer. You can track this state also.

@deffn Command semantic-show-parser-state-mode &optional arg @anchor{semantic-show-parser-state-mode} Minor mode for displaying parser cache state in the modeline. The cache can be in one of three states. They are Up to date, Partial reprase needed, and Full reparse needed. The state is indicated in the modeline with the following characters: @table @kbd @item - The cache is up to date. @item ! The cache requires a full update. @item ^ The cache needs to be incrementally parsed. @item % The cache is not currently parseable. @item @@ Auto-parse in progress (not set here.) @end table

With prefix argument @var{ARG}, turn on if positive, otherwise off. The minor mode can be turned on only if semantic feature is available and the current buffer was set up for parsing. Return non-@code{nil} if the minor mode is enabled. @end deffn

When the incremental parser starts updating the tags buffer, you can also enable a set of messages to help identify how the incremental parser is merging changes with the main buffer.

@defvar semantic-edits-verbose-flag @anchor{semantic-edits-verbose-flag} Non-@code{nil} means the incremental perser is verbose. If @code{nil}, errors are still displayed, but informative messages are not. @end defvar

@node Debugging Analysis @section Debugging Analysis

The semantic analyzer is a at the top of the food chain when it comes to @semantic{} service functionality. The semantic support for a language must be absolute before analysis will work property.

A good way to test analysis is by placing the cursor in different places, and requesting a dump of the context.

@deffn Command semantic-analyze-current-context position @anchor{semantic-analyze-current-context} Analyze the current context at @var{POSITION}. If called interactively, display interesting information about @var{POSITION} in a separate buffer. Returns an object based on symbol @dfn{semantic-analyze-context}. @end deffn

@ref{Semantic Analyzer Support} @inforef{Analyzer, , semantic-user}

@node Semantic 1.4 Doc @section Semantic 1.4 Doc

@i{ In semantic 1.4 the following documentation was written for debugging. I'm leaving in here until better doc for 2.0 is done. }

Writing language files using BY is significantly easier than writing then using regular expressions in a functional manner. Debugging them, however, can still prove challenging.

There are two ways to debug a language definition if it is not behaving as expected. One way is to debug against the source @file{.by} file.

If your language definition was written in BNF notation, debugging is quite easy. The command @code{semantic-debug} will start you off.

@deffn Command semantic-debug Reprase the current buffer and run in parser debug mode. @end deffn

While debugging, two windows are visible. One window shows the file being parsed, and the syntactic token being tested is highlighted. The second window shows the table being used (in the BY source) with the current rule highlighted. The cursor will sit on the specific match rule being tested against.

In the minibuffer, a brief summary of the current situation is listed. The first element is the syntactic token which is a list of the form:

@example (TYPE START . END) @end example

The rest of the display is a list of all strings collected for the currently tested rule. Each time a new rule is entered, the list is restarted. Upon returning from a rule into a previous match list, the previous match list is restored, with the production of the dependent rule in the list.

Use @kbd{C-g} to stop debugging. There are no commands for any fancier types of debugging.

NOTE: Semantic 2.0 has more debugging commands. Use: @kbd{C-h m semantic-debug-mode} to view.

@node Parser Error Handling @chapter Parser Error Handling @cindex Parser Error Handling

NOTE: Write Me

@node GNU Free Documentation License @appendix GNU Free Documentation License

@include fdl.texi

@node Index @unnumbered Index @printindex cp

@iftex @contents @summarycontents @end iftex

@bye

@c Following comments are for the benefit of ispell.