Add New Languages
There are three different ways in which byexample can be extended:
- define zones where to find examples
- support new languages: how to find them and how to run them
- perform arbitrary actions during the execution
byexample uses the concept of modules: a python file with some extension
classes defined there. Modules can be loaded using --modules <dir>
from the command line.
What extension classes will depend of what you want to extend or customize.
In this how-to we will see how to add a new language.
Check how to define new zones where to find examples
and how to hook to events with concerns
for a how-to about the first and last items.
Imagine that we want to write examples in the mythical language ArnoldC,
a programming language which its instruction set are phrases of a famous
actor.
What do we need?
How to find examples: the Finder
The first thing to teach byexample is how to find a ArnoldC
example.
Most of the languages supported by byexample use a prompt to mark
the begin of an example.
But just for fun, let’s imagine that we want to do something different.
Let’s say that our examples are enclosed by the ~~~ strings: anything
between two ~~~ will be considered a ArnoldC example.
Here is what I mean
This is an example which begins here
~~~
IT'S SHOWTIME # byexample: +awesome
TALK TO THE HAND "Hello World!"
YOU HAVE BEEN TERMINATED
out:
Hello World!
~~~
The code below should produce the famous 'Hello World!' output
Notice how below the code there is a out: tag. We will use this to
separate the code from the expected output.
Find the snippet of code
To accomplish this we need to create a regular expression to find the
~~~, where the snippet of code is and where the expected output is.
>>> from byexample import regex as re
>>> example_re = re.compile(r'''
... # begin with ~~~
... ^[ ]* ~~~ [ ]*\n
...
... # grab everything until the 'out:' string
... # this will be our snippet of code
... (?P<snippet>
... (?:^(?P<indent> [ ]*)[^ ] .*) # first line: learn what is
... # the level of indentation
...
... (?:\n # grab everything else...
... (?![ ]*out:[ ]*\n) # except if the line starts
... # with out:
... (?![ ]*~~~) # or with ~~~
...
... .*)*) # anything else is welcome
... \n?
...
... # now, if we find 'out:', grab the expected output
... # this part of the regex is optional because not all the examples
... # output something to compare with.
... (?: [ ]* out:[ ]*\n
... (?P<expected> (?:
... (?![ ]*~~~) # except a ~~~ line,
... .+$\n? # grab everything!
... )*)
... )?
...
... # finally, the end marker
... ^[ ]* ~~~ [ ]*$
...
... ''', re.MULTILINE | re.VERBOSE)
The capture’s groups snippet, indent and expected are mandatory.
The capture may be empty but those three groups must be defined.
The first should match the executable code, while the last the expected output if any that to compare.
The indent group is to count how many spaces are not part of the example
and they are just for indentation: byexample will drop the first line that
has a lower level of indentation and any subsequent line.
Changed in
byexample 10.0.0. Before10.0.0you could return a Python regular expression but from10.0.0and on, you need to return the regular expressions created bybyexample.regex. The module is almost identical to Python’sreso the required changes are minimal.
Detect the language
Then, the finder needs to determinate in which language the example was written.
For our purposes let’s say that anything between ~~~ is always an
ArnoldC example but the same finder could find examples of
different languages.
The Finder class
Now we assemble all the pieces.
We need to create a class, inherit from ExampleFinder,
define a target attribute and implement a few methods:
>>> from byexample.finder import ExampleFinder
>>> class ArnoldCFinder(ExampleFinder):
... target = 'ArnoldC-session'
...
... def example_regex(self):
... global example_re
... return example_re # defined above
...
... def get_language_of(self, options, match, where):
... return 'ArnoldC'
...
... def spurious_endings(self):
... return spurious_endings(self) # defined above
The target attribute may need a little explanation. All the
Finders must declare which type of examples they are targeting.
If two finders try to find the same target, one will override the other.
This is useful if you want to use a different Finder in replacement for
an already created one: just create a class with the same target.
Let’s see if our finder can find the ArnoldC snippet above.
>>> from byexample.cfg import Config
>>> finder = ArnoldCFinder(cfg=Config(verbosity=0, encoding='utf-8'))
>>> filepath = 'docs/contrib/how-to-support-new-finders-and-languages.md'
>>> where = (0,1,filepath,None)
>>> matches = finder.get_matches(open(filepath, 'r').read())
>>> matches = list(matches)
>>> len(matches)
1
>>> match = matches[0]
>>> indent = match.group('indent')
>>> len(indent)
4
>>> snippet, expected = finder.get_snippet_and_expected(match, where)
>>> print(snippet)
IT'S SHOWTIME # byexample: +awesome
TALK TO THE HAND "Hello World!"
YOU HAVE BEEN TERMINATED
>>> print(expected)
Hello World!
The get_snippet_and_expected by default gets the snippet and the
expected groups from the match. But you can extend this to post-process
the strings.
Take a look of the implementation of PythonFinder
(in byexample/modules/python.py)
The PythonFinder will find and match Python examples that starts with
the prompt >>>; later, it extends get_snippet_and_expected to remove
the prompts from the snippet to return valid Python code.
How to support new languages: the Parser and the Runner
To support new languages we need to be able to parse the code in the first place and to execute it later.
Now that we have a raw snippet from the Finder we need to polish it and
extract the options that byexample uses to customize the example.
Get the options
The options can be of any form and be in any place.
Typically we can write the options in the comments of the code which obviously will depend on the language.
If the comments in ArnoldC starts with a #, we can say that every comment
that starts with byexample is a comment that will contain options.
This regular expression should capture that:
>>> from byexample import regex as re
>>> opts_string_re = re.compile(r'#\s*byexample:\s*([^\n\'"]*)$',
... re.MULTILINE)
The unnamed group should capture the option or options; how to extract each individual option is a task for more complex parser than a simple regex.
byexample will create a parser for us to parse all the common options (the
ones that byexample supports by default).
It is our job to extend this parser adding more flags or arguments to parse our
own specific options (ArnoldC’s specific).
>>> def extend_option_parser(parser):
... parser.add_flag("awesome")
See the documentation of the class OptionParser for more information.
Note: the
extend_option_parserin theory is called once for each example. Howeverbyexamplecalls it as few times as possible for performance reasons.The extraction and parsing of the options are cached: if two examples have the same options, they are parsed once and therefor the parser itself is extended just once.
If you need to tweak or disable this you could override methods of the class ExampleParser
The Parser class
Now we assemble all the pieces.
We need to create a class, inherit from ExampleParser,
define a language attribute and implement the missing methods:
>>> from byexample.parser import ExampleParser
>>> class ArnoldCParser(ExampleParser):
... language = 'arnold'
...
... def example_options_string_regex(self):
... global opts_string_re
... return opts_string_re
...
... def extend_option_parser(self, parser):
... return extend_option_parser(parser)
The user can select which languages should be parsed and executed and which
should not from the command line with the flag -l.
So we need to declare what language is our Parser for: that’s the reason
behind the language attribute.
Optionally you can add define the flavors attribute: a set of
different flavors for your language that you have support that the user
can select with -l (parse the example in a different way perhaps?).
Let’s create the example (in the practice this is done by byexample behind
the scenes so you do not to be worry about the details):
>>> from byexample.options import Options, OptionParser
>>> parser = ArnoldCParser(cfg=Config(verbosity=0, encoding='utf-8', options=Options(rm=[], norm_ws=False, tags=True, capture=True, type=False, input_prefix_range=(6,12), optparser=OptionParser(add_help=False))))
>>> from byexample.finder import Example
>>> runner = None # not yet
>>> example = Example(finder, runner, parser,
... snippet, expected, indent, where)
At this point, the example created is incomplete as its source code wasn’t extracted from the snippet nor its options.
>>> example.source
<...>
AttributeError: 'Example' object has no attribute 'source'
>>> example.options
<...>
AttributeError: 'Example' object has no attribute 'options'
These attributes are completed using the parser who is the only one that knows how to extract these options from a raw example because is a language specific task.
>>> from byexample.log import init_log_system
>>> init_log_system() # needed becuase parse_yourself requires the log system
>>> example = example.parse_yourself()
>>> print(example.source)
IT'S SHOWTIME # byexample: +awesome
TALK TO THE HAND "Hello World!"
YOU HAVE BEEN TERMINATED
>>> print(example.expected.str)
Hello World!
>>> print(example.options)
{'awesome': True}
The process_snippet_and_expected method can be extended to perform the last
minute changed to the snippet and the expected strings, after the parsing of the
options.
>>> hasattr(ExampleParser, 'process_snippet_and_expected')
True
See GDBParser in byexample/modules/gdb.py.
The implementation extends this method to remove any comment on the snippet
because GDB doesn’t support them.
Other useful example is PythonParser byexample/modules/python.py
It modifies heavily the expected string to support a compatibility mode with doctest.
TL;DR - Options from the command line and from the examples
byexample allow the user to pass options from the command line via
-o. These options are the same that you can set on each example but
they affect to the whole execution.
Following our +awesome example we could call byexample like:
$ byexample -l arnold -o=+awesome somefile.md # byexample: +skip
Parsing the options from the command line is done by
ExampleParser.extract_cmdline_options while the parsing from each
example is done by ExampleParser.extract_options.
In general you don’t need to worry about those two: byexample
uses the your option parser provided by extend_option_parser
in a reasonable manner.
Only if you want to do something more sophisticated you may want to
override those two methods. Just remember to call the parent method
and keep in mind that while the parsing of the options from the command
line happens in the main thread, the rest of the parsing happens on
each worker (this is relevant only if you need to share information between
them, in which case you need to share data safety.)
The Runner class
The Runner is who will execute the code.
Most of the times it is a proxy to a real interpreter but it can be a mix of compiler/runner depending of the underlying language.
To see how this ‘proxy’ class can interact with another program, check the
implementation of the Python and Ruby Interpreters of byexample in
byexample/modules/python.py and
byexample/modules/ruby.py
For our case, we will implement a small toy-interpreter in Python itself so
you do not need to install a real ArnoldC compiler.
>>> from byexample import regex as re
>>> def toy_arnoldc_interpreter(source_code):
... output = []
... for line in source_code.split('\n'):
... if line.startswith("TALK TO THE HAND"):
... to_print = re.search(r'"([^"]*)"', line).group(1)
... output.append(to_print + '\n')
...
... return '\n'.join(output)
Now we ensemble the ExampleRunner subclass
>>> from byexample.runner import ExampleRunner
>>> class ArnoldCRunner(ExampleRunner):
... language = 'arnold'
...
... def run(self, example, options):
... return toy_arnoldc_interpreter(example.source)
...
... def initialize(self, examples, options):
... pass
...
... def shutdown(self):
... pass
The initialize and shutdown methods are called before and after the
execution of all the tests. It can be used to set up the real interpreter
or to perform some off-line task (like compiling).
If possible, try to implement the cancel method to cancel an ongoing
example and support the recovery after a timeout.
You may want to change how to setup the interpreter or the compiler based on the examples that it will execute or in the options passed from the command line.
The options parameter are the parsed options (plus the
options that come from the command line).
It is in the run method where the magic happen.
Its task is to execute the given source and to return the output, if any.
What to do with them is up to you.
>>> runner = ArnoldCRunner(cfg=Config(verbosity=0, encoding='utf-8'))
>>> found = runner.run(example, example.options)
>>> found
'Hello World!\n'
>>> print("PASS" if found == example.expected.str else "FAIL")
PASS
Like in the Parser, you can define the optional flavors attribute
to accept different flavors of the same language. The constructor of you
Runner will receive the language_flavors argument with the list
selected by the user so you can do something different on each case.
Concurrency model
Each ExampleFinder, ExampleParser and ExampleRunner instances
will be created once during the setup of
byexample and then it will be created once per job thread.
By default there is only one job thread but more threads can be added
with the --jobs option.
If you want to share data among them you will have to use a
thread-safe structures created by a sharer and store them
in a namespace.
In the concurrency model documentation it is explained and in byexample/modules/progress.py you can see a concrete example.
Changed in
byexample 10.0.0. Before10.0.0you were forced to usemultiprocessingby hand but in10.0.0the concurrency model is hidden so you cannot relay onmultiprocessingbecausebyexamplemay not use processes at all!sharerandnamespaceare objects that hide the details while allowing you to have the same power.
ExampleFinder, ExampleParser and ExampleRunner initialization
If you decide to implement your own __init__,
you must ensure that you call parent class’ __init__ method
passing to it all the keyword-only arguments that you received.
Once done that, you can use the self.cfg property to access any
configuration set in byexample including the flags/options set
(self.cfg.options).
In the __init__ you can also change the value of target (for the
ExampleFinder) and the value of language (for ExampleParser and
ExampleRunner) based on the configuration.
You can use it for example to disable the finder/parser/runner setting
target / language to None dynamically.
See Extension initialization for more about this and some troubleshooting.
New in
byexample 11.0.0:self.cfgwas introduced.