Add New Languages
There are three different ways in which byexample
can be extended:
- define zones where to find examples
- support new languages: how to find them and how to run them
- perform arbitrary actions during the execution
byexample
uses the concept of modules: a python file with some extension
classes defined there. Modules can be loaded using --modules <dir>
from the command line.
What extension classes will depend of what you want to extend or customize.
In this how-to
we will see how to add a new language.
Check how to define new zones where to find examples
and how to hook to events with concerns
for a how-to
about the first and last items.
Imagine that we want to write examples in the mythical language ArnoldC
,
a programming language which its instruction set are phrases of a famous
actor.
What do we need?
How to find examples: the Finder
The first thing to teach byexample
is how to find a ArnoldC
example.
Most of the languages supported by byexample
use a prompt to mark
the begin of an example.
But just for fun, let’s imagine that we want to do something different.
Let’s say that our examples are enclosed by the ~~~
strings: anything
between two ~~~
will be considered a ArnoldC
example.
Here is what I mean
This is an example which begins here
~~~
IT'S SHOWTIME # byexample: +awesome
TALK TO THE HAND "Hello World!"
YOU HAVE BEEN TERMINATED
out:
Hello World!
~~~
The code below should produce the famous 'Hello World!' output
Notice how below the code there is a out:
tag. We will use this to
separate the code from the expected output.
Find the snippet of code
To accomplish this we need to create a regular expression to find the
~~~
, where the snippet of code is and where the expected output is.
>>> from byexample import regex as re
>>> example_re = re.compile(r'''
... # begin with ~~~
... ^[ ]* ~~~ [ ]*\n
...
... # grab everything until the 'out:' string
... # this will be our snippet of code
... (?P<snippet>
... (?:^(?P<indent> [ ]*)[^ ] .*) # first line: learn what is
... # the level of indentation
...
... (?:\n # grab everything else...
... (?![ ]*out:[ ]*\n) # except if the line starts
... # with out:
... (?![ ]*~~~) # or with ~~~
...
... .*)*) # anything else is welcome
... \n?
...
... # now, if we find 'out:', grab the expected output
... # this part of the regex is optional because not all the examples
... # output something to compare with.
... (?: [ ]* out:[ ]*\n
... (?P<expected> (?:
... (?![ ]*~~~) # except a ~~~ line,
... .+$\n? # grab everything!
... )*)
... )?
...
... # finally, the end marker
... ^[ ]* ~~~ [ ]*$
...
... ''', re.MULTILINE | re.VERBOSE)
The capture’s groups snippet
, indent
and expected
are mandatory.
The capture may be empty but those three groups must be defined.
The first should match the executable code, while the last the expected output if any that to compare.
The indent
group is to count how many spaces are not part of the example
and they are just for indentation: byexample
will drop the first line that
has a lower level of indentation and any subsequent line.
Changed in
byexample 10.0.0
. Before10.0.0
you could return a Python regular expression but from10.0.0
and on, you need to return the regular expressions created bybyexample.regex
. The module is almost identical to Python’sre
so the required changes are minimal.
Detect the language
Then, the finder needs to determinate in which language the example was written.
For our purposes let’s say that anything between ~~~
is always an
ArnoldC
example but the same finder could find examples of
different languages.
The Finder class
Now we assemble all the pieces.
We need to create a class, inherit from ExampleFinder
,
define a target
attribute and implement a few methods:
>>> from byexample.finder import ExampleFinder
>>> class ArnoldCFinder(ExampleFinder):
... target = 'ArnoldC-session'
...
... def example_regex(self):
... global example_re
... return example_re # defined above
...
... def get_language_of(self, options, match, where):
... return 'ArnoldC'
...
... def spurious_endings(self):
... return spurious_endings(self) # defined above
The target
attribute may need a little explanation. All the
Finders must declare which type of examples they are targeting.
If two finders try to find the same target, one will override the other.
This is useful if you want to use a different Finder
in replacement for
an already created one: just create a class with the same target
.
Let’s see if our finder can find the ArnoldC
snippet above.
>>> from byexample.cfg import Config
>>> finder = ArnoldCFinder(cfg=Config(verbosity=0, encoding='utf-8'))
>>> filepath = 'docs/contrib/how-to-support-new-finders-and-languages.md'
>>> where = (0,1,filepath,None)
>>> matches = finder.get_matches(open(filepath, 'r').read())
>>> matches = list(matches)
>>> len(matches)
1
>>> match = matches[0]
>>> indent = match.group('indent')
>>> len(indent)
4
>>> snippet, expected = finder.get_snippet_and_expected(match, where)
>>> print(snippet)
IT'S SHOWTIME # byexample: +awesome
TALK TO THE HAND "Hello World!"
YOU HAVE BEEN TERMINATED
>>> print(expected)
Hello World!
The get_snippet_and_expected
by default gets the snippet
and the
expected
groups from the match. But you can extend this to post-process
the strings.
Take a look of the implementation of PythonFinder
(in byexample/modules/python.py)
The PythonFinder
will find and match Python
examples that starts with
the prompt >>>
; later, it extends get_snippet_and_expected
to remove
the prompts from the snippet to return valid Python
code.
How to support new languages: the Parser and the Runner
To support new languages we need to be able to parse the code in the first place and to execute it later.
Now that we have a raw snippet from the Finder we need to polish it and
extract the options that byexample
uses to customize the example.
Get the options
The options can be of any form and be in any place.
Typically we can write the options in the comments of the code which obviously will depend on the language.
If the comments in ArnoldC
starts with a #
, we can say that every comment
that starts with byexample
is a comment that will contain options.
This regular expression should capture that:
>>> from byexample import regex as re
>>> opts_string_re = re.compile(r'#\s*byexample:\s*([^\n\'"]*)$',
... re.MULTILINE)
The unnamed group should capture the option or options; how to extract each individual option is a task for more complex parser than a simple regex.
byexample
will create a parser for us to parse all the common options (the
ones that byexample
supports by default).
It is our job to extend this parser adding more flags or arguments to parse our
own specific options (ArnoldC
’s specific).
>>> def extend_option_parser(parser):
... parser.add_flag("awesome")
See the documentation of the class OptionParser for more information.
Note: the
extend_option_parser
in theory is called once for each example. Howeverbyexample
calls it as few times as possible for performance reasons.The extraction and parsing of the options are cached: if two examples have the same options, they are parsed once and therefor the parser itself is extended just once.
If you need to tweak or disable this you could override methods of the class ExampleParser
The Parser class
Now we assemble all the pieces.
We need to create a class, inherit from ExampleParser
,
define a language
attribute and implement the missing methods:
>>> from byexample.parser import ExampleParser
>>> class ArnoldCParser(ExampleParser):
... language = 'arnold'
...
... def example_options_string_regex(self):
... global opts_string_re
... return opts_string_re
...
... def extend_option_parser(self, parser):
... return extend_option_parser(parser)
The user can select which languages should be parsed and executed and which
should not from the command line with the flag -l
.
So we need to declare what language is our Parser for: that’s the reason
behind the language
attribute.
Optionally you can add define the flavors
attribute: a set of
different flavors for your language that you have support that the user
can select with -l
(parse the example in a different way perhaps?).
Let’s create the example (in the practice this is done by byexample
behind
the scenes so you do not to be worry about the details):
>>> from byexample.options import Options, OptionParser
>>> parser = ArnoldCParser(cfg=Config(verbosity=0, encoding='utf-8', options=Options(rm=[], norm_ws=False, tags=True, capture=True, type=False, input_prefix_range=(6,12), optparser=OptionParser(add_help=False))))
>>> from byexample.finder import Example
>>> runner = None # not yet
>>> example = Example(finder, runner, parser,
... snippet, expected, indent, where)
At this point, the example created is incomplete as its source code wasn’t extracted from the snippet nor its options.
>>> example.source
<...>
AttributeError: 'Example' object has no attribute 'source'
>>> example.options
<...>
AttributeError: 'Example' object has no attribute 'options'
These attributes are completed using the parser who is the only one that knows how to extract these options from a raw example because is a language specific task.
>>> from byexample.log import init_log_system
>>> init_log_system() # needed becuase parse_yourself requires the log system
>>> example = example.parse_yourself()
>>> print(example.source)
IT'S SHOWTIME # byexample: +awesome
TALK TO THE HAND "Hello World!"
YOU HAVE BEEN TERMINATED
>>> print(example.expected.str)
Hello World!
>>> print(example.options)
{'awesome': True}
The process_snippet_and_expected
method can be extended to perform the last
minute changed to the snippet and the expected strings, after the parsing of the
options.
>>> hasattr(ExampleParser, 'process_snippet_and_expected')
True
See GDBParser
in byexample/modules/gdb.py.
The implementation extends this method to remove any comment on the snippet
because GDB
doesn’t support them.
Other useful example is PythonParser
byexample/modules/python.py
It modifies heavily the expected string to support a compatibility mode with doctest
.
TL;DR - Options from the command line and from the examples
byexample
allow the user to pass options from the command line via
-o
. These options are the same that you can set on each example but
they affect to the whole execution.
Following our +awesome
example we could call byexample
like:
$ byexample -l arnold -o=+awesome somefile.md # byexample: +skip
Parsing the options from the command line is done by
ExampleParser.extract_cmdline_options
while the parsing from each
example is done by ExampleParser.extract_options
.
In general you don’t need to worry about those two: byexample
uses the your option parser provided by extend_option_parser
in a reasonable manner.
Only if you want to do something more sophisticated you may want to
override those two methods. Just remember to call the parent method
and keep in mind that while the parsing of the options from the command
line happens in the main
thread, the rest of the parsing happens on
each worker
(this is relevant only if you need to share information between
them, in which case you need to share data safety.)
The Runner class
The Runner is who will execute the code.
Most of the times it is a proxy to a real interpreter but it can be a mix of compiler/runner depending of the underlying language.
To see how this ‘proxy’ class can interact with another program, check the
implementation of the Python and Ruby Interpreters of byexample
in
byexample/modules/python.py and
byexample/modules/ruby.py
For our case, we will implement a small toy-interpreter in Python
itself so
you do not need to install a real ArnoldC
compiler.
>>> from byexample import regex as re
>>> def toy_arnoldc_interpreter(source_code):
... output = []
... for line in source_code.split('\n'):
... if line.startswith("TALK TO THE HAND"):
... to_print = re.search(r'"([^"]*)"', line).group(1)
... output.append(to_print + '\n')
...
... return '\n'.join(output)
Now we ensemble the ExampleRunner
subclass
>>> from byexample.runner import ExampleRunner
>>> class ArnoldCRunner(ExampleRunner):
... language = 'arnold'
...
... def run(self, example, options):
... return toy_arnoldc_interpreter(example.source)
...
... def initialize(self, examples, options):
... pass
...
... def shutdown(self):
... pass
The initialize
and shutdown
methods are called before and after the
execution of all the tests. It can be used to set up the real interpreter
or to perform some off-line task (like compiling).
If possible, try to implement the cancel
method to cancel an ongoing
example and support the recovery after a timeout.
You may want to change how to setup the interpreter or the compiler based on the examples that it will execute or in the options passed from the command line.
The options
parameter are the parsed options (plus the
options that come from the command line).
It is in the run
method where the magic happen.
Its task is to execute the given source and to return the output, if any.
What to do with them is up to you.
>>> runner = ArnoldCRunner(cfg=Config(verbosity=0, encoding='utf-8'))
>>> found = runner.run(example, example.options)
>>> found
'Hello World!\n'
>>> print("PASS" if found == example.expected.str else "FAIL")
PASS
Like in the Parser
, you can define the optional flavors
attribute
to accept different flavors of the same language. The constructor of you
Runner
will receive the language_flavors
argument with the list
selected by the user so you can do something different on each case.
Concurrency model
Each ExampleFinder
, ExampleParser
and ExampleRunner
instances
will be created once during the setup of
byexample
and then it will be created once per job thread.
By default there is only one job thread but more threads can be added
with the --jobs
option.
If you want to share data among them you will have to use a
thread-safe structures created by a sharer
and store them
in a namespace
.
In the concurrency model documentation it is explained and in byexample/modules/progress.py you can see a concrete example.
Changed in
byexample 10.0.0
. Before10.0.0
you were forced to usemultiprocessing
by hand but in10.0.0
the concurrency model is hidden so you cannot relay onmultiprocessing
becausebyexample
may not use processes at all!sharer
andnamespace
are objects that hide the details while allowing you to have the same power.
ExampleFinder
, ExampleParser
and ExampleRunner
initialization
If you decide to implement your own __init__
,
you must ensure that you call parent class’ __init__
method
passing to it all the keyword-only arguments that you received.
Once done that, you can use the self.cfg
property to access any
configuration set in byexample
including the flags/options set
(self.cfg.options
).
In the __init__
you can also change the value of target
(for the
ExampleFinder
) and the value of language (for ExampleParser
and
ExampleRunner
) based on the configuration.
You can use it for example to disable the finder/parser/runner setting
target
/ language
to None
dynamically.
See Extension initialization for more about this and some troubleshooting.
New in
byexample 11.0.0
:self.cfg
was introduced.