Pipe Operator in Python
Let’s do something completely silly today. Let’s add a pipe operator |>
to Python that works the same as in the Gleam programming language. From the Gleam docs:
import gleam/io
import gleam/string
pub fn main() {
// Without the pipe operator
io.println(string.drop_start(string.drop_end("Hello, Joe!", 1), 7))
// With the pipe operator
"Hello, Mike!"
|> string.drop_end(1)
|> string.drop_start(7)
|> io.println
}
Basically what we have here is an operator that allows us to chain function calls. I think this makes code much more readable by elucidating the different steps being used to transform data. Here’s how I imagine this looking in Python (I’m going to use a more illustrative example here because slicing in Python allows us to easily “drop” characters from a string):
def add_one(l):
return (i + 1 for i in l)
print(sum(add_one([1, 2, 3])))
# new pipe syntax:
[1, 2, 3] |> add_one |> sum |> print
# 9
Again, this example really illustrates how the nested function calls obfuscate what we’re trying to do – you have to find your way to the center of all those parentheses to see the original function argument and then back your way out to the print
call. The pipe operator on the other hand makes it immediately clear what we’re up to – we start with some data and then apply various transformations to it.
Baby’s first introduction to Python’s grammar
We know how we want our syntax to look, but how do we actually go about implementing it? When I started working on this post I thought it would require me to write a lot of C code, but it turns out that I only touched a few lines of C! I mostly tweaked some metadata files in the CPython repository.
Before implementing the full pipe syntax, I thought it would be easier to introduce a new operator |>
that does the same thing as the +
operator. In other words, I wanted to be able to write print(5 |> 6)
and have my custom Python interpreter spit out “11”.
First things first, let’s clone the Python repo, checkout the version I’m interesting in mucking around with, and configure everything. I’m basically following the directions here.
After forking CPython on Github:
> git clone git@github.com:dean-shaff/cpython.git
> cd cpython
> git checkout v3.13.5
> ./configure --with-pydebug && make -j8
(Shout out to the Python devs: this just works straight out of the box with no messing around!)
Now I should have lil Python executable in my cpython
directory:
> ./python.exe -c "print(5 + 6)"
11
> ./python.exe -c "print(5 |> 6)"
File "<string>", line 1
5 |> 6
^
SyntaxError: invalid syntax
Obviously the pipe operator doesn’t work because we haven’t implemented anything yet! What files do we need to modify to introduce this new operator?
- Grammar/Tokens
- Grammar/python.gram
(I’m serious, that’s it!)
In Grammar/Tokens
we need to add our new special token:
VBAR '|'
PIPE '|>'
AMPER '&'
Now, in Grammar/python.gram
we need to add our new operator to the Python grammar:
# Arithmetic operators
# --------------------
sum[expr_ty]:
| a=sum '+' b=term { _PyAST_BinOp(a, Add, b, EXTRA) }
| a=sum '-' b=term { _PyAST_BinOp(a, Sub, b, EXTRA) }
| a=sum '|>' b=term { _PyAST_BinOp(a, Add, b, EXTRA) }
| term
(I’ll get more into this later)
Now, we have to run some conveniently defined make
commands to regenerate some C-code:
> make regen-token
> make regen-pegen
> make -j8
Now we can run our slick new custom operator:
> ./python.exe -c "print(5 |> 6)"
11
Very cool, but what’s actually happening here? First, we add our new pipe operator to the list of tokens that Python recognizes. From what I can tell, a token is a character or sequence of characters that are recognized as having special meaning to the Python interpreter. They are used as part of lexical analysis, which is the process of reading .py files and turning them into an abstract syntax tree which can in turn be evaluated. To be more specific, “lexing” is the process of ingesting a stream of text data (a .py file) and turning it into a series of tokens that have some attached meaning. For example, the +
operator does not serve the same grammatical purpose as the def
keyword; lexing not only identifies the presense of +
and def
, but also distinguishes between +
as an operator and def
as a keyword. “Parsing” is the process of taking that stream of tokens and converting it into an abstract syntax tree (AST). The structure of this AST reflects that of your program. Today we need to change how Python does both lexing and parsing; we need Python to identify |>
as an operator when scanning through Python source code, and we need it to introduce an addition operation into the AST when it sees that operator.
We can see the files that the make regen-token
and make regen-pegen
commands modify:
Doc/library/token-list.inc
Include/internal/pycore_token.h
Lib/token.py
Parser/parser.c
Parser/token.c
I’m going to ignore Doc/library/token-list.inc
and Lib/token.py
; those aren’t super interesting. Taking a look at Parser.token.c
, we see some changes in the _PyToken_TwoChars
function:
int
_PyToken_TwoChars(int c1, int c2)
{
switch (c1) {
case '!':
switch (c2) {
case '=': return NOTEQUAL;
}
break;
case '%':
switch (c2) {
case '=': return PERCENTEQUAL;
}
break;
case '&':
switch (c2) {
case '=': return AMPEREQUAL;
}
break;
case '*':
switch (c2) {
case '*': return DOUBLESTAR;
case '=': return STAREQUAL;
}
break;
case '+':
switch (c2) {
case '=': return PLUSEQUAL;
}
break;
case '-':
switch (c2) {
case '=': return MINEQUAL;
case '>': return RARROW;
}
break;
case '/':
switch (c2) {
case '/': return DOUBLESLASH;
case '=': return SLASHEQUAL;
}
break;
case ':':
switch (c2) {
case '=': return COLONEQUAL;
}
break;
case '<':
switch (c2) {
case '<': return LEFTSHIFT;
case '=': return LESSEQUAL;
case '>': return NOTEQUAL;
}
break;
case '=':
switch (c2) {
case '=': return EQEQUAL;
}
break;
case '>':
switch (c2) {
case '=': return GREATEREQUAL;
case '>': return RIGHTSHIFT;
}
break;
case '@':
switch (c2) {
case '=': return ATEQUAL;
}
break;
case '^':
switch (c2) {
case '=': return CIRCUMFLEXEQUAL;
}
break;
case '|':
switch (c2) {
case '=': return VBAREQUAL;
case '>': return PIPE;
}
break;
}
return OP;
}
We can see in that last case
block that our PIPE
token is being returned when we see the presence of |
followed by >
! That’s pretty darn cool!
In Include/internal/pycore_token.h
we see that we’ve used the preprocessor to define the PIPE
constant.
Things get a little more complicated in parser.c
; here’s where the line that we added to Grammar/python.gram
actually gets translated to C code. In Grammar/python.gram
we basically copied and pasted the line defining what happens with the addition operator and repurposed it for our new |>
operator. To be honest, I don’t understand this syntax entirely, but I think we can get a pretty good idea what’s going on:
| a=sum '|>' b=term { _PyAST_BinOp(a, Add, b, EXTRA) }
_PyAST_BinOp(a, Add, b, EXTRA)
is doing the addition operation between a
(the left-hand side) and b
(the right-hand side) operands. We can see this reflected in parser.c
. Some thirteen thousand lines in, we see this new block of code:
{ // sum '|>' term
if (p->error_indicator) {
p->level--;
return NULL;
}
D(fprintf(stderr, "%*c> sum[%d-%d]: %s\n", p->level, ' ', _mark, p->mark, "sum '|>' term"));
Token * _literal;
expr_ty a;
expr_ty b;
if (
(a = sum_rule(p)) // sum
&&
(_literal = _PyPegen_expect_token(p, 19)) // token='|>'
&&
(b = term_rule(p)) // term
)
{
D(fprintf(stderr, "%*c+ sum[%d-%d]: %s succeeded!\n", p->level, ' ', _mark, p->mark, "sum '|>' term"));
Token *_token = _PyPegen_get_last_nonnwhitespace_token(p);
if (_token == NULL) {
p->level--;
return NULL;
}
int _end_lineno = _token->end_lineno;
UNUSED(_end_lineno); // Only used by EXTRA macro
int _end_col_offset = _token->end_col_offset;
UNUSED(_end_col_offset); // Only used by EXTRA macro
_res = _PyAST_BinOp ( a , Add , b , EXTRA );
if (_res == NULL && PyErr_Occurred()) {
p->error_indicator = 1;
p->level--;
return NULL;
}
goto done;
}
p->mark = _mark;
D(fprintf(stderr, "%*c%s sum[%d-%d]: %s failed!\n", p->level, ' ',
p->error_indicator ? "ERROR!" : "-", _mark, p->mark, "sum '|>' term"));
}
I’m not going to pretend to understand exactly what’s happening here, but we do see the call to _PyAST_BinOp
, and we see that we’re expecting the |>
token.
Gimme dat pipe
Now that we’ve introduced the |>
operator, let’s make it do what we actually want. Here, we can go into python.gram
again, and change out the line | a=sum '|>' b=term { _PyAST_BinOp(a, Add, b, EXTRA) }
with something that will actually call the right-hand side of the expression with the left-hand side as an argument. Digging around in that same file, we can see that there is a _PyAST_Call
instruction. After some experimentation, I ended up using the following line:
| a=sum '|>' b=term { _PyAST_Call(b, CHECK(asdl_expr_seq*, _PyPegen_singleton_seq(p, a)), NULL, EXTRA) }
To be 100% honest, I don’t understand exactly what’s going on here. It seems that the first argument to the _PyAST_Call
instruction is the callable that we want to call, and the second is the argument that we want to call, but I don’t really know what all the business is about CHECK(asdl_expr_seq*, _PyPegen_singleton_seq(p, a))
. If someone can help me out with this I’d really appreciate it!
When we run those same make commands again, we end up with a Python interpreter that evaluates our new pipe operator in exactly the way that we want:
> make regen-pegen
> make -j8
> ./python.exe -c "5 |> (lambda a: a + 5) |> print"
10
Note that we have to wrap our lambda in parentheses otherwise we won’t parse things correctly.
Let’s write a script that shows off some of the cool things we can do with this pipe operator:
def add_five(x):
return x + 5
def double(x):
return x * 2
def add_one(x):
return x + 1
5 |> add_one |> double |> print
3 |> add_five |> (lambda x: x * 3) |> print
def reverse(s):
return s[::-1]
"hello" |> str.upper |> reverse |> print
def double_list(lst):
return (x * 2 for x in lst)
def sum_list(lst):
return sum(lst)
[1, 2, 3] |> double_list |> sum_list |> print
When we run this we see the following output:
> ./python.exe example.py
12
24
OLLEH
12
Partial application
This is all pretty cool, but what if we want to use our pipe operator with functions that take more than one argument? Take, for example, the double_list
function from the previous example. What if we wanted to be able to multiple every element of our list by an arbitrary number? Right now we’d have to write a closure or use functools.partial
:
from collections.abc import Iterable
from functools import partial
# using closure
def multiply(factor: int):
def inner(lst: Iterable[int]) -> Iterable[int]:
return (x * factor for x in lst)
return inner
[1, 2, 3] |> multiple(10) |> print
# using partial
def multiply(factor: int, lst: Iterable[int]) -> Interable[int]:
return (x * factor for x in lst)
[1, 2, 3] |> partial(multiple, 10) |> print
This works, but it feels a little verbose! What if we made Python functions behave like Gleam functions, in the sense that functions are always partially applied1? Let’s write a function decorator that will allow for the following behaviour:
# example2.py
from partialize import partialize
@partialize
def multiply(factor: int, lst: Iterable[int]) -> Interable[int]:
return (x * factor for x in lst)
[1, 2, 3] |> multiple(10) |> list |> print
To do this, we need to get some metadata about the function that is being passed to the partialize
decorator. Luckily, Python has the inspect
module that allows us to get information about “live” Python objects.
from collections.abc import Callable
from dataclasses import dataclass
import functools
import inspect
@dataclass
class Partial:
fn: Callable
n_parameters: int
def __call__(self, *args, **kwargs):
print(f"__call__: {args=}, {len(args)=}")
if len(args) == self.n_parameters:
return self.fn(*args, **kwargs)
return self.init(functools.partial(self.fn, *args, **kwargs))
@classmethod
def init(cls, fn: Callable):
sig = inspect.signature(fn)
n_parameters = sum(
1
for p in sig.parameters.values()
if p.kind
in (
inspect.Parameter.POSITIONAL_ONLY,
inspect.Parameter.POSITIONAL_OR_KEYWORD,
)
)
return cls(fn, n_parameters)
def partialize(fn):
return Partial.init(fn)
Now, I’m sure that there are smarter ways to go about this, but this implementation is relatively clear. We create a decorator partialize
that creates an instance of Partial
. When we call the __call__
method on Partial
it does one of two things. If the number of arguments passed to the function is equal to the number of arguments that the inner function expects, then evaluate the function. If the number of arguments is less than the number of arguments expected, then create a new Partial
object whose inner function is a functools.partial
object.
The key difference between this implementation and Gleam’s partial application is that my version is left applied, and Gleam’s is right applied. In other words, when you use partialize
, you expect to pass arguments left to right, creating new closures along the way. This is why the signature of multiply
is multiple(factor: int, lst: Iterable[int]) -> Iterable[int]
, not multiple(lst: Iterable[int], factor: int) -> Iterable[int]
. In Gleam, it works the other way around; function application happens from right to left. I think the Gleam approach is slightly more intuitive, but also less Pythonic.
If we run the example from above we see that things work as expected:
> ./python.exe example2.py
[10, 20, 30]
Wrapping up
In this post I dipped my toes into playing around with the CPython lexer, adding my own pipe operator that behaves like Gleam’s. Going into this post, I thought I would have to write a lot of C code, but it turns out that I just had to adjust a few configuration files, and the development tools generate the underlying C code on their own. I don’t see this pipe operator making its way into the Python language upstream, but it is fun to experiment with adding custom functionality to the language I spend most of my time writing and reading. It also makes me wonder what my ideal programming language would look like. Perhaps one day I’ll release a custom superset of Python that incorporates a bunch of cool, expressive features like this pipe operator.
-
What do I mean by “partial application” when it comes to functions? Remember the first Gleam example, where we chain together calls to
string.drop_end
andstring.drop_start
? If we look at the docs for both functions, they take two arguments. When we callstring.drop_end
with a single argument, it acts as a closure, returning a new function that takes a single string argument. When we call this, then we actually get a new string with the lastn
characters lopped off. ↩