Run the command "enju" or "mogura". "enju" is a slow but high accuracy parser, while "mogura" is a slightly low-accurate but very fast parser.
The parser starts reading data files and waits for your input.
% enju Enju 2.3 Copyright (c) 2005-2008, Tsujii Laboratory, The University of Tokyo. All rights reserved. Loading grammar module "enju/grammar"... done. Loading FOM module "enju/synmodel"... done. Loading parser module "mayz/up"... done. Loading module "enju/outputdep"... done. Initializing parser... Loading grammar database: /usr/local/lib/enju/DATA/Enju.lexicon /usr/local/lib/enju/DATA/Enju.templates Initializing external tagger: /usr/local/bin/stepp -t -m /usr/local/share/stepp/models_wsj02-21 Initializing morphological analyzer: /usr/local/bin/enju-morph /usr/local/lib/enju/DATA Loading Unigram FOM model: /usr/local/lib/enju/DATA/Enju-lex.output.gz Loading Syntax FOM model: /usr/local/lib/enju/DATA/Enju-syn.output.gz Loading lexicon table file '/usr/local/lib/enju/DATA/Enju.lexicon.tbl'... done done. Ready
Input a sentence in a line, and you will get the parse result in the standard output. The following example is an output for the sentence "Enju is an efficient HPSG parser."
ROOT ROOT ROOT ROOT -1 ROOT ROOT is be VBZ VB 1 is be VBZ VB 1 verb_arg12 ARG1 Enju enju NNP NNP 0 is be VBZ VB 1 verb_arg12 ARG2 parser parser NN NN 5 an an DT DT 2 det_arg1 ARG1 parser parser NN NN 5 efficient efficient JJ JJ 3 adj_arg1 ARG1 parser parser NN NN 5 HPSG hpsg NNP NNP 4 noun_arg1 ARG1 parser parser NN NN 5
Enju supports a predicate-argument relation format, an XML format, and a stand-off format. It also has a CGI server to respond XML-format outputs.
By default, the output is a set of predicate-argument relations between words. For example, transitive verb "love" takes two nominal arguments. When Enju parses the sentence "I love you.", it outputs a predicate-subject relation between "love" and "I", and a predicate-object relation between "love" and "you". Each line represents one predicate-argument relation, and an empty line indicates the end of the sentence. Columns of a line are separated by tabs, and express the following information.
The first line represents the root predicate of the sentence. In this line, the predicate is represented as "ROOT" and the label of the relation is also represented as "ROOT". If the argument of a predicate is missing (for example, a logical subject in a passive expression without "by" phrase), it is shown as "UNKNOWN".
The position of a word is represented with an integer starting from zero. In the example, the position of "Enju" is 0, "is" is 1, ... and "parser" is 6. Words whose POS is "." (e.g. "." and "?") are ignored. The position numbers of "ROOT" and "UNKNOWN" are output as "-1".
The label of a relation is represented with one of "MOD", "ARG1", ..., and "ARG4". "ARG1" is for a subject of a verb, a target of modification by modifiers (such as modifiers and prepositions), etc. "ARG2" represents an object of verbs, prepositions, etc. The other "ARGx" represents objects and complements of verbs, etc. "MOD" is used for participial constructions etc. It denotes a clause modified by another clause, such as a verbal or an adverbial clause, and the subordinate clause has its own ARG1.
A predicate word is not always a head word. In the above example, The head word for "an efficient HPSG parser" is "parser" but "parser" is an argument of the predicate "efficient". This is because, for example, we represent the relation between "parser" and "efficient" in the sentence "An HPSG parser is efficient" as the same predicate-argument relations.
When the parsing of the whole sentence fails, fragmental parse results are output. In this case, multiple "ROOT" lines are output because a ROOT line is output for each fragmental parse result. Note that predicate-argument relations do not necessarily form a connected graph.
If parsing fails completely, the parser shows "Parsing failure" and its reason.
Enju supports the output in XML and stand-off formats. The parse results are output in the XML format when specifying "-xml" option, while in the stand-off format with "-so" option. These formats represent not only predicate-argument relations but also phrase structures.
In the XML format, phrase structure and predicate-argument structure are printed with XML tags and their attributes. The structure of a sentence is shown in a line. The following example is the output of parsing "Enju is an efficient HPSG parser." (the actual output is in one line).
<sentence id="s0" parse_status="success"><cons id="c0" cat="S" xcat="" head="c3" sem_head="c3" schema="subj_head"><cons id="c1" cat="NP" xcat="" head="c2" sem_head="c2" schema="empty_spec_head"> <cons id="c2" cat="NX" xcat="" head="t0" sem_head="t0"><tok id="t0" cat="N" pos="NNP" base="enju" lexentry="[D<N.3sg>]_lxm" pred="noun_arg0">Enju</tok></cons></cons> <cons id="c3" cat="VP" xcat="" head="c4" sem_head="c4" schema="head_comp"><cons id="c4" cat="VX" xcat="" head="t1" sem_head="t1"><tok id="t1" cat="V" pos="VBZ" base="be" tense="present" aspect="none" voice="active" aux="minus" lexentry="[NP.nom<V.cpl.bse>NP.acc]_lxm-singular3rd_verb_rule" pred="verb_arg12" arg1="c1" arg2="c5">is</tok></cons> <cons id="c5" cat="NP" xcat="" head="c7" sem_head="c7" schema="spec_head"><cons id="c6" cat="DP" xcat="" head="t2" sem_head="t2"><tok id="t2" cat="D" pos="DT" base="an" lexentry="[<D>]N" pred="det_arg1" arg1="c7">an</tok></cons> <cons id="c7" cat="NX" xcat="" head="c9" sem_head="c9" schema="mod_head"><cons id="c8" cat="ADJP" xcat="" head="t3" sem_head="t3"><tok id="t3" cat="ADJ" pos="JJ" base="efficient" lexentry="[<ADJP.adj>]N" pred="adj_arg1" arg1="c9">efficient</tok></cons> <cons id="c9" cat="NX" xcat="" head="c11" sem_head="c11" schema="mod_head"><cons id="c10" cat="NP" xcat="" head="t4" sem_head="t4"><tok id="t4" cat="N" pos="NNP" base="hpsg" lexentry="[D<N.3sg>]_lxm-noun_adjective_rule" pred="noun_arg1" arg1="c11">HPSG</tok></cons> <cons id="c11" cat="NX" xcat="" head="t5" sem_head="t5"><tok id="t5" cat="N" pos="NN" base="parser" lexentry="[D<N.3sg>]_lxm" pred="noun_arg0">parser</tok></cons> </cons></cons></cons></cons></cons>.</sentence>
The entire sentence is bracketed by <sentence>. When the parsing has succeeded, the "parse_status" attribute will be "success". Phrase structures are represented with <cons>. A constituent is bracketed by <cons>, and the attribute "cat" represents the phrase symbol of the constituent. For example, a noun phrase, "an efficient HPSG parser", is represented as "<cons cat="NP">an efficient HPSG parser</cons>".
Each word is bracketed by <tok>. The attributes "pos" and "base" represent a part-of-speech and a base form. "cat" represents the same information as in <cons>.
ID numbers (unique in an input file) are assigned to all "cons" and "tok". ID numbers are represented with the attribute "id". The tags "cons" include the attributes "head", which represent the syntactic head daughter of the phrase. "sem_head" represents the semantic head of the phrase. For example, the head of an auxiliary verb phrase is an auxiliary verb, while its semantic head is a main verb.
Predicate-argument relations of words are represented with the attributes "mod", "arg1", ..., "arg4" in "tok". A predicate word has some of the above attributes, each of which represents the ID number of an argument phrase. In the above example, the "tok" tag for "is" has arg1="c1" arg2="c5", and they represent the ID numbers of "Enju" and "an efficient HPSG parser", respectively.
When the parsing of the whole sentence fails, fragmental parse results are output. In this case, the attribute "parse_status" will be "fragmental parse". Multiple <cons> elements are output directly under <sentence>, and each of them represents a fragmental parse result.
When parsing has failed, the "parse_status" attribute of <sentence> represents the reason of the failure. Neither <cons> nor <tok> are output.
In the XML format, more syntactic/semantic information is output. For details, see "Enju Output Specifications".
In the stand-off format, the span of each tag is represented with the position in the original input sentence. Each line represents a tag. The above XML-format output is represented with the following stand-off format.
0 33 sentence id="s0" parse_status="success" 0 32 cons id="c0" cat="S" xcat="" head="c3" sem_head="c3" schema="subj_head" 0 4 cons id="c1" cat="NP" xcat="" head="c2" sem_head="c2" schema="empty_spec_head" 0 4 cons id="c2" cat="NX" xcat="" head="t0" sem_head="t0" 0 4 tok id="t0" cat="N" pos="NNP" base="enju" lexentry="[D<N.3sg>]_lxm" pred="noun_arg0" 5 32 cons id="c3" cat="VP" xcat="" head="c4" sem_head="c4" schema="head_comp" 5 7 cons id="c4" cat="VX" xcat="" head="t1" sem_head="t1" 5 7 tok id="t1" cat="V" pos="VBZ" base="be" tense="present" aspect="none" voice="active" aux="minus" lexentry="[NP.nom<V.cpl.bse>NP.acc]_lxm-singular3rd_verb_rule" pred="verb_arg12" arg1="c1" arg2="c5" 8 32 cons id="c5" cat="NP" xcat="" head="c7" sem_head="c7" schema="spec_head" 8 10 cons id="c6" cat="DP" xcat="" head="t2" sem_head="t2" 8 10 tok id="t2" cat="D" pos="DT" base="an" lexentry="[<D>]N" pred="det_arg1" arg1="c7" 11 32 cons id="c7" cat="NX" xcat="" head="c9" sem_head="c9" schema="mod_head" 11 20 cons id="c8" cat="ADJP" xcat="" head="t3" sem_head="t3" 11 20 tok id="t3" cat="ADJ" pos="JJ" base="efficient" lexentry="[<ADJP.adj>]N" pred="adj_arg1" arg1="c9" 21 32 cons id="c9" cat="NX" xcat="" head="c11" sem_head="c11" schema="mod_head" 21 25 cons id="c10" cat="NP" xcat="" head="t4" sem_head="t4" 21 25 tok id="t4" cat="N" pos="NNP" base="hpsg" lexentry="[D<N.3sg>]_lxm-noun_adjective_rule" pred="noun_arg1" arg1="c11" 26 32 cons id="c11" cat="NX" xcat="" head="t5" sem_head="t5" 26 32 tok id="t5" cat="N" pos="NN" base="parser" lexentry="[D<N.3sg>]_lxm" pred="noun_arg0"
Elements of a line are seperated with tabs. The first and the second columns represent the start and the end position, respectively. Positions are abosolute positions from the beginning of the input file. The last column represents the content of a tag. The label of a tag (e.g. "cons" and "tok") is output first, and the rest represents the attributes. Information of tags and attributes is the same as that of the XML format.
You can access Enju via a network by using Enju as an HTTP server. Run "enju -cgi port_number", and Enju works as a CGI server. Access to "/cgi-lilfes/enju?" of the specified port number, give a sentence as a CGI argument, "sentence", and you will get a parse result in the XML format.
For example, when you run "enju -cgi 10000" at "localhost", and access to:
http://localhost:10000/cgi-lilfes/enju?sentence=Enju+is+an+efficient+HPSG+parser.You will get the XML output as is shown above.
When you access to "/cgi-lilfes/enju?" without any arguments, a simple HTML form will be output. You can try the CGI server by using a web browser that supports XHTML, Javascript and XSLT (e.g. FireFox).
Enju accepts the following options and command-line arguments.
enju [options] [-a arguments] | |
Arguments following "-a" are passed to LiLFeS programs as command-line arguments. | |
Options | |
-h | Show help message |
-hh | Show detailed help message |
-D directory | Specify the directory of the Enju grammar |
-L directory | Specify the directory of LiLFeS modules (the directory is added to the beginning of "LILFES_PATH".) |
-t tagger | Specify a POS tagger |
-m stemmer | Specify a stemmer |
-nt | Disable a POS tagger |
-d | Output in predicate-argument relation format |
-xml | Output in XML format |
-so | Output in stand-off format |
-cgi port_number | Start CGI server |
-genia | Use a parsing model for the biomedical domain |
-brown | Use a parsing model for the literature domain |
-A | Allow ambiguous POS tagging (improves parsing accuracy, but lowers parsing speed) |
-N number | Output N-best results |
-W number | Limit number of sentence length |
-E number | Limit number of edges |
-C number | Limit size of large constituents |
-l module | Load LiLFeS program |
-e command | Execute LiLFeS command |
-i | Go into interactive mode (show lilfes prompt) |
-n | Non-interactive mode |
When LiLFeS modules are specified with "-l", the modules are loaded to the parser. If LiLFeS commands are specified with "-e", Enju executes the specified lilfes commands. With the "-i" option, Enju shows a lilfes prompt and waits for the input of lilfes programs. "Ctrl-D" ends the interactive mode.
When you have installed grammar data and/or LiLFeS modules in non-default directories, you need to set the following environment variables to tell Enju the installation directories. Environment variables are overwritten by command-line arguments.
Variable | Description |
---|---|
ENJU_PREFIX | Specify the directory where Enju is installed (affects the places for the grammar data, default POS tagger and stemmer) |
ENJU_DIR | Specify the directory of the Enju grammar (corresponding to "-D") |
ENJU_TAGGER | Specify a POS tagger (corresponding to "-t") |
ENJU_MORPH | Specify a stemmer (corresponding to "-m") |
LILFES_PATH | Specify search paths of LiLFeS modules (corresponding to "-L") |
LD_LIBRARY_PATH | Specify the directory of liblilfes |