For one of my projects I needed some one-liners parser to AST. I’ve tried PLY , pyPEG and a few more. And stopped on pyparsing . It’s actively maintained, works without magic and easy to use.
Ideally I wanted to parse something like:
LANG=en_US.utf-8 git diff | wc -l >> diffs
To something like:
(= LANG en_US.utf-8) (>> (| (git diff) (wc -l)) (diffs))
So let’s start with simple shell command, it’s just space-separated tokens:
import pyparsing as pp token = pp.Word(pp.alphanums + '_-.') command = pp.OneOrMore(token) command.parseString('git branch --help') >>> ['git', 'branch', '--help']It’s simple, another simple part is parsing environment variables. One environment variable is token=token , and list of them separated by spaces:
env = pp.Group(token + '=' + token) env.parseString('A=B') >>>[['A', '=', 'B']] env_list = pp.OneOrMore(env) env_list.parseString('VAR=test X=1') >>> [['VAR', '=', 'test'], ['X', '=', '1']]And now we can easily merge command and environment variables, mind that environment variables are optional:
command_with_env = pp.Optional(pp.Group(env_list)) + pp.Group(command) command_with_env.parseString('LOCALE=en_US.utf-8 git diff') >>> [[['LOCALE', '=', 'en_US.utf-8']], ['git', 'diff']]Now we need to add support of pipes, redirects and logical operators. Here we don’t need to know what they’re doing, so we’ll treat them just like separators between commands:
separators = ['1>>', '2>>', '>>', '1>', '2>', '>', '<', '||', '|', '&&', '&', ';'] separator = pp.oneOf(separators) command_with_separator = pp.OneOrMore(pp.Group(command) + pp.Optional(separator)) command_with_separator.parseString('git diff | wc -l >> out.txt') >>> [['git', 'diff'], '|', ['wc', '-l'], '>>', ['out.txt']]And now we can merge environment variables, commands and separators:
one_liner = pp.Optional(pp.Group(env_list)) + pp.Group(command_with_separator) one_liner.parseString('LANG=C DEBUG=true git branch | wc -l >> out.txt') >>> [[['LANG', '=', 'C'], ['DEBUG', '=', 'true']], [['git', 'branch'], '|', ['wc', '-l'], '>>', ['out.txt']]]Result is hard to process, so we need to structure it:
one_liner = pp.Optional(env_list).setResultsName('env') + \ pp.Group(command_with_separator).setResultsName('command') result = one_liner.parseString('LANG=C DEBUG=true git branch | wc -l >> out.txt') print('env:', result.env, '\ncommand:', result.command) >>> env: [['LANG', '=', 'C'], ['DEBUG', '=', 'true']] >>> command: [['git', 'branch'], '|', ['wc', '-l'], '>>', ['out.txt']]Although we didn’t get AST, but just a bunch of grouped tokens. So now we need to transform it to proper AST:
def prepare_command(command): """We don't need to work with pyparsing internal data structures, so we just convert them to list. """ for part in command: if isinstance(part, str): yield part else: yield list(part) def separator_position(command): """Find last separator position.""" for n, part in enumerate(command[::-1]): if part in separators: return len(command) - n - 1 def command_to_ast(command): """Recursively transform command to AST.""" n = separator_position(command) if n is None: return tuple(command[0]) else: return (command[n], command_to_ast(command[:n]), command_to_ast(command[n + 1:])) def to_ast(parsed): if parsed.env: for env in parsed.env: yield ('=', env[0], env[2]) command = list(prepare_command(parsed.command)) yield command_to_ast(command) list(to_ast(result)) >>> [('=', 'LANG', 'C'), >>> ('=', 'DEBUG', 'true'), >>> ('>>', ('|', ('git', 'branch'), >>> ('wc', '-l')), >>> ('out.txt',))]It’s working. The last part, glue that make it easier to use:
def parse(command): result = one_liner.parseString(command) ast = to_ast(result) return list(ast) parse('LANG=en_US.utf-8 git diff | wc -l >> diffs') >>> [('=', 'LANG', 'en_US.utf-8'), ('>>', ('|', ('git', 'diff'), ('wc', '-l')), ('diffs',))]Although it can’t parse all one-liners, it doesn’t support nested commands like:
echo $(git branch) echo `git branch`
But it’s enough for my task and support of not implemented features can be added easily.
Gist with source code.