title: "Pangler: literate programming in Pandoc" author: Federico Igne date: \today ...
This documents describes the logic and design of pangler
, a minimal tangler for literate programming using the Pandoc Markdown syntax.
Literate Programming (LP) is a programming paradigm that emphasize the natural flow of thoughts that the programmer experiences when writing software. The paradigm can be seen as "documentation first" and the focus is on "human-to-human" communication. The produced document is a text-based prose document describing the logic and the design of the program, interspersed with snippets of code that form the final software.
Given an LP document, one can either extract the tangled code (with a "tangler") or generate its documentation, "woven" from the literate input source (with a "weaver").
In this case, Pandoc is a very good weaver, supporting the generation of different document formats from a Markdown source.
This document is an attempt at providing a tangler working alongside Pandoc.
pangler
is itself written in Pandoc Markdown format and can be generated from this document using itself.
Literate programming with pangler
pangler
uses two main features provided by the Pandoc Markdown syntax, which are not necessarily present in other Markdown flavours:
backtick_code_blocks
for writing fenced snippets of code, andfenced_code_attributes
for adding arbitrary HTML attributes, classes and ID to a snippet of code.
Writing programs
In the following, we indicate a literate program as a markdown file written in Pandoc Markdown syntax.
A minimal block of code, recognized by pangler
, with ID identifier
is
```{#identifier}
[code snippet]
```
Code blocks can contain code macros of the form <<identifier>>
where identifier
is a valid code block ID.
Code macros will be recursively substituted by the corresponding code snippet during code generation.
A code macro needs to be placed in its own line, with an optional (whitespace) indentation, used during code generation to indent the code snippet.
Additional attributes and classes can be added to a code block, as well; the language of the code snippet can be provided and is useful to enable correct syntax highlighting.
```{#identifier .python}
[python code snippet]
```
Identifiers
An identifier can either be a string, representing a macro to be used inside other code blocks, or a filename. In the latter, the code block will be considered a valid entry point for code generation.
#[derive(Eq, Hash, PartialEq)]
enum Key {
Macro(String),
Entry(PathBuf)
}
impl Key {
fn get_path(&self) -> Option<&PathBuf> {
match self {
Self::Entry(s) => Some(&s),
Self::Macro(_) => None
}
}
}
There are currently 3 (possibly overlapping) scenarios in which an identifier is considered a valid filename.
First, the filename matches the following regex
static ref PATH: Regex =
Regex::new(
r"^(?:[[:word:]\.-]+/)*[[:word:]\.-]+\.[[:alpha:]]+$"
).unwrap();
For example
```{#file.py .python}
[python main file]
```
Second, the code block contains a path
attribute.
With this feature, we can generated a file into a more complex folder structure.
For example, the following code block determines the content of file path/to/file.py
.
```{#file.py .python path="path/to/"}
[python main file]
```
This path is relative to the current working directory, unless the -o
/--output
flag is used.
Third, the code contains the entry
class.
This is useful when declaring an entry point that doesn't match any of the previous cases.
```{#Dockerfile .dockerfile .entry}
[Docker directives]
```
Any ID that doesn't match any of the previous cases is considered an internal macro. Code blocks without an ID are ignored.
if !id.is_empty() {
let key = {
<<regex_path_lazy>>
let entry = clss.contains(&String::from("entry"));
let path = attrs
.into_iter()
.find_map(|(k,p)|
if k == "path" { Some(p.clone()) } else { None });
if entry || path.is_some() || PATH.is_match(id) {
let path =
PathBuf::from(path.unwrap_or_default()).join(id);
if path.starts_with(&target) {
Some(Key::Entry(path))
} else {
None
}
} else {
Some(Key::Macro(id.to_string()))
}
};
if let Some(key) = key {
<<code_block>>
}
} else {
eprintln!("Ignoring code block without ID:");
eprintln!("{}", indent(Cow::from(code),4));
}
Redefining code blocks
Code blocks are processed in order. By default, if an identifier is already defined, the code block is appended to the current corresponding value.
Use the override
class in the code block definition to cause the block to override the previous entry with the same key, if this exists.
```{#identifier .python .override}
[Python code snippet]
```
This is handled in code as follows
if clss.iter().any(|c| c == "override") {
blocks.insert(key, Cow::from(code));
} else {
blocks.entry(key)
.and_modify(|s| {
*s += "\n";
*s += Cow::from(code)
})
.or_insert(Cow::from(code));
}
Tangling: generating the source files
To bootstrap the tangling process, a tangled version of the program is provided alongside the literate version.
The executable can be compiled from the root of the project with
cargo build --release
From now on you can make changes to the README.md
file and use your latest compiled version of pangler
to tangle and compile it.
Using Docker
An alternative way to compile the project is using Docker. Run the following from the root of the project without the need to install Rust locally.
docker run \
--rm \
--user "$(id -u)":"$(id -g)" \
--volume "$PWD":/usr/src/pangler \
--workdir /usr/src/pangler \
rust:latest \
cargo build --release
See the official documentation for more information.
The executable will be in target/release/pangler
.
Weaving: generating the documentation
As explained above we use pandoc
as a weaver.
Run the following command to generate a PDF file for this document
pandoc --to latex \
--listings \
--number-sections \
--lua-filter=util/weaver.lua \
--output pangler.pdf \
README.md
The Lua filter util/weaver.lua
is provided to handle custom pangler
attributes during the PDF generation via the \LaTeX\ engine.
Integration with (Neo)Vim
(Neo)Vim supports code highlighting inside Markdown blocks, when the programming language is provided among its attributes. Add the following to your config file to enable code highlighting for a specific set of languages
let g:markdown_fenced_languages =
['python','rust','scala']
Command Line Interface
pangler
offers a very simple command line interface.
For an overview of the functionalities offered by the tool run
pangler --help
pangler
uses the clap
library to parse command line arguments
clap = { version = "3.1", features = ["derive"] }
use clap::Parser;
using the Derive API to define the exposed functionalities.
The struct
holding the CLI information is defined as follow
/// A tangler for Literate Programming in Pandoc
#[derive(Parser, Debug)]
#[clap(author, version, about, long_about = None)]
struct Config {
<<config_list>>
<<config_depth>>
<<config_output>>
<<config_target>>
<<config_input>>
}
and the arguments are parsed as
let config = Config::parse();
Input files
pangler
accepts a sequence of files that will be parsed, code will be collected and used to build the final program.
Note that the order of the file provided on the CLI is important when using the overriding functionality.
/// Input files
input: Vec<PathBuf>,
Specifying target entry points
By default pangler
will generate all entry points gathered from the input file(s).
This behaviour can be overridden with the -t/--target
flag.
/// Limit entry points to those matching the given prefix
#[clap(short, long, value_name="PREFIX")]
target: Option<PathBuf>,
let target = config.target.unwrap_or_default();
Any entry point that does not have the provided target as a prefix will be ignored.
Listing entry points
Sometimes it might be useful to simply get a list of all entry points considered by pangler
for a specific input.
Using the -l/--list
flag, pangler
will simply list all valid entry points to stdout and exit.
Note that the output will be consistent with any provided target.
/// Simply list entry points and exit
#[clap(short, long)]
list: bool,
Custom output folder
By default, files are generated in the current working directory.
const BASE: &str = "./";
This behaviour can be overridden using the -o
/--output
flag.
If the output folder does not exists, it will be created.
/// Base output directory [default: './']
#[clap(short, long, value_name="PATH")]
output: Option<PathBuf>,
Limiting recursion depth
Finally, recursive substitution of blocks can lead to an infinite loop.
By default, pangler
will stop after 10 substitution iterations, but this parameter can be changed with the -d
/--depth
flag.
/// Maximum substitution depth
#[clap(short, long, default_value_t=10, value_name="N")]
depth: u32,
The program
The program is structured as a single Rust file with the following being the main entry point of the program
<<uses>>
<<constants>>
<<config>>
<<types>>
<<functions>>
fn main() -> Result<()> {
<<config_parse>>
<<pandoc_setup>>
Ok(())
}
Pandoc
We are using rust-pandoc
and pandoc-ast
to interact with pandoc
from Rust.
pandoc = "0.8"
pandoc_ast = "0.8"
use pandoc::{
InputFormat,InputKind,OutputFormat,OutputKind,Pandoc
};
use pandoc_ast::Block;
First we need to initialize a new Pandoc
struct
let mut pandoc = Pandoc::new();
and set up the input parameters. The input is a sequence of Markdown files passed as config options from the CLI.
pandoc.set_input(InputKind::Files(config.input));
pandoc.set_input_format(InputFormat::Markdown, vec![]);
The output is piped to stdout in JSON format.
pandoc.set_output(OutputKind::Pipe);
pandoc.set_output_format(OutputFormat::Json, vec![]);
In this way, we will be able to pipe the output into a Pandoc filter that will collect the code snippets and build the codebase for us.
pandoc.add_filter(
move |json| pandoc_ast::filter(json,
|pandoc| {
<<pandoc_filter>>
}
)
);
pandoc.execute().unwrap();
Pandoc filters
Pandoc allows for the definition of custom filters to change the abstract syntax tree of a document.
In this case we use a filter to collect code snippets from the input Markdown text into a HashMap
, mapping code block identifiers to code block snippets.
use std::borrow::Cow;
use std::collections::HashMap;
type Blocks<'a> = HashMap<Key,Cow<'a,str>>;
Code blocks are wrapped into a Cow
, i.e., a "copy-on-write" smart pointer, to avoid string duplication, unless strictly necessary.
We iterate over all code blocks, along with their IDs, classes and attributes, collecting them
let mut blocks: Blocks = HashMap::new();
pandoc.blocks.iter().for_each(|block|
if let Block::CodeBlock((id,clss,attrs), code) = block {
<<code_block_gathering>>
}
);
And then we build the source code, making sure to cut off recursive code generation with depth larger than config.depth
.
If the -l/--list
flag is provided, pangler
will simply list the available entry points and exit.
if config.list {
blocks.keys().for_each(|k| match k {
Key::Entry(s) => println!("{}", s.display()),
Key::Macro(_) => {}
});
} else {
build(&config.output, &blocks, config.depth);
}
The filter returns the Pandoc JSON unchanged.
pandoc
Source code generation
In order to build the source code from the gathered code block snippets, we need to recursively substitute code macros of the form <<identifier>>
with the corresponding code block.
Code macros are matched with the following regex
static ref MACRO: Regex =
Regex::new(
r"(?m)^([[:blank:]]*)<<([^>\s]+)>>"
).unwrap();
Note that, when matching the code macro, we keep track of its indentation as well, in order to properly indend code.
Given a code macro, the following closure will compute the substituting block of code, properly indented.
The input Captures
structure is a vector with the regex capture groups, i.e., indentation and macro identifier, along with the full match in the first position.
In case we reach the maximum allowed depth we truncate code block substitution and notify the user that something might not have been generated as expected.
|caps: &Captures| {
if current_depth < max_depth {
let block = blocks
.get(&Key::Macro(caps[2].to_string()))
.unwrap_or_else(|| panic!(
"Block \"{}\" not present",
caps[2].to_string()))
.clone();
indent(block, caps[1].len())
} else {
eprintln!("Reached maximum depth, \
output might be truncated.\n\
Increase `--depth` accordingly.");
Cow::Owned(String::from(""))
}
}
As explained above, the building process iterates over all collected blocks and detects relevant entry points (files to generate) to start the recursive macro substitution.
fn build(
base: &Option<PathBuf>,
blocks: &Blocks,
max_depth: u32
) {
<<regex_macro_lazy>>
blocks
.iter()
.filter_map(|(key,code)| {
key.get_path().map(|k| (k,code))
})
.for_each(|(path,code)| {
<<code_generation>>
})
}
Recursive macro substitution
The code generating algorithm went through multiple iterations and showed some interesting details of using Cow
s.
let mut current_depth = 0;
let mut code = code.clone();
while MACRO.is_match(&code) {
code = MACRO.replace_all(
&code,
<<macro_closure>>
);
current_depth += 1;
}
The problem with this version is that, due to how Cow
works, the value returned by replace_all
cannot live longer than the borrowed code
passed as a parameter.
This is because the function returns a reference to code
(Cow::Borrowed
) if no replacement takes place, so for the returned value to be valid, code
still needs to be available.
But here, code
gets overridden right away, so, in principle, if no replacement takes place code
gets overridden by a reference to itself (losing data).
However, note that this doesn't happen in practice (but the compiler doesn't know about this) because the replace_all
function is applied as long as some replacement is possible (while
condition).
In other words, all calls to replace_all
always return an Cow::Owned
value.
The problem is solved by a clever use of pattern matching
let mut current_depth = 0;
let mut code = code.clone();
while let Cow::Owned(new_code) = MACRO.replace_all(
&code,
<<macro_closure>>
) {
code = Cow::from(new_code);
current_depth += 1;
}
In this case, the matched Cow::Owned
is not concerned by any lifetime (the type is Cow<'_,str>
) of the borrowed value code
.
Moreover code
takes ownership of new_code: String
using the Cow::from()
function.
No heap allocation is performed, and the string is not copied.
Finally, we write the code to file
let file = base
.clone()
.unwrap_or(PathBuf::from(BASE))
.join(path);
write_to_file(file, &code).unwrap();
Additional details
Code indentation
When (positive) code indentation is required, the processed block of code is indented by indent
.
let prefix = format!("{:indent$}", "");
Each line is then prefix
ed separately and the result is returned.
fn indent<'a>(
input: Cow<'a,str>,
indent: usize
) -> Cow<'a,str> {
if indent > 0 {
<<indent_prefix>>
let size = input.len() + indent*input.lines().count();
let mut output = String::with_capacity(size);
input.lines().enumerate().for_each(|(i,line)| {
if i > 0 {
output.push('\n');
}
if !line.is_empty() {
output.push_str(&prefix);
output.push_str(line);
}
});
Cow::Owned(output)
} else {
input
}
}
Note that, if no indentation is required (i.e., indent
is equal to 0), no additional allocation is performed, and the input
is returned as is.
RegEx matching
pangler
uses the regex
library to perform regular expression matching and substitution.
Moreover, the library suggests the use of lazy_static
to ensure that the regexes used are compiled exactly once per execution.
lazy_static = "1.4"
regex = "1.5"
use lazy_static::lazy_static;
use regex::{Captures,Regex};
We wrap the regex definitions in a lazy_static
macro
lazy_static! {
<<regex_path>>
}
lazy_static! {
<<regex_macro>>
}
Writing to file
Writing to file is an operation performed using the Rust support for OS operations from the standard library.
use std::fs;
use std::io::Result;
use std::path::PathBuf;
First, all necessary parent directories of path
are created
fs::create_dir_all(path.parent().unwrap())?;
and then the content
is written to the file provided by
fs::write(path, content)?;
We perform a check on path
and only write the content to the file if the path is relative to the current working directory.
fn write_to_file(
path: PathBuf, content: &str
) -> std::io::Result<()> {
if path.is_relative() {
<<parent_directory_creation>>
<<write_to_file>>
} else {
eprintln!(
"Absolute paths not supported: {}",
path.display()
)
}
Ok(())
}
Credits
pangler
was created by Federico Igne (git@federicoigne.com) and available at https://git.dyamon.me/projects/pangler
.
[package]
name = "pangler"
version = "0.4.0"
edition = "2021"
[dependencies]
<<dependencies>>