--- title: "Pangler: literate programming in Pandoc" author: Federico Igne date: \today ... This documents describes the logic and design of `pangler`, a minimal tangler for literate programming using the [Pandoc Markdown syntax](https://pandoc.org/MANUAL.html#pandocs-markdown). [Literate Programming](https://en.wikipedia.org/wiki/Literate_programming) (LP) is a programming paradigm that emphasize the natural flow of thoughts that the programmer experiences when writing software. The paradigm can be seen as "documentation first" and the focus is on "human-to-human" communication. The produced document is a text-based prose document describing the logic and the design of the program, interspersed with snippets of code that form the final software. Given an LP document, one can either extract the *tangled* code (with a "tangler") or generate its documentation, "woven" from the literate input source (with a "weaver"). In this case, [Pandoc](https://pandoc.org) is a very good weaver, supporting the generation of different document formats from a Markdown source. This document is an attempt at providing a tangler working alongside Pandoc. `pangler` is itself written in Pandoc Markdown format and can be generated from this document using itself. # Literate programming with `pangler` `pangler` uses two main features provided by the Pandoc Markdown syntax, which are not necessarily present in other Markdown flavours: 1. [`backtick_code_blocks`](https://pandoc.org/MANUAL.html#extension-backtick_code_blocks) for writing fenced snippets of code, and 2. [`fenced_code_attributes`](https://pandoc.org/MANUAL.html#extension-fenced_code_attributes) for adding arbitrary HTML attributes, classes and ID to a snippet of code. ## Writing programs In the following, we indicate a *literate program* as a markdown file written in Pandoc Markdown syntax. A minimal block of code, recognized by `pangler`, with ID `identifier` is ~~~ ```{#identifier} [code snippet] ``` ~~~ Code blocks can contain **code macros** of the form `<>` where `identifier` is a valid code block ID. Code macros will be recursively substituted by the corresponding code snippet during [code generation](#tangling-generating-the-source-files). A code macro needs to be placed in its own line, with an optional (whitespace) indentation, used during code generation to indent the code snippet. Additional attributes and classes can be added to a code block, as well; the language of the code snippet can be provided and is useful to enable correct syntax highlighting. ~~~ ```{#identifier .python} [python code snippet] ``` ~~~ An identifier can also be a file name matching the following regex ```{#regex_path .rust} static ref PATH: Regex = Regex::new( r"^(?:[[:word:]\.-]+/)*[[:word:]\.-]+\.[[:alpha:]]+$" ).unwrap(); ``` In that case the code block is considered a valid **entry point** for the generation of a file with that name. The code block defines the content of the new file. ~~~ ```{#file.py .python} [python main file] ``` ~~~ File names can be generated in subfolders using the `path` attribute. The following code block determines the content of file `path/to/file.py`. ~~~ ```{#file.py .python path="path/to/"} [python main file] ``` ~~~ This path is relative to the current working directory, unless [the `-o`/`--output` flag is used](#command-line-interface). Code blocks without an ID are ignored. ```{#code_block_gathering .rust} if !id.is_empty() { let key = { let path = attrs.iter().find(|(k,_)| k == "path"); if let Some(path) = path { format!("{}{}", path.1, id) } else { id.to_string() } }; <> } else { eprintln!("Ignoring code block without ID:"); eprintln!("{}", indent(Cow::from(code),4)); } ``` Code blocks are processed in order. By default, if an identifier is already defined, the code block is appended to the current corresponding value. Use the `override` class in the code block definition to cause the block to override the previous entry with the same key, if this exists. ~~~ ```{#identifier .python .override} [Python code snippet] ``` ~~~ This is handled in code as follows ```{#code_block .rust} if clss.iter().any(|c| c == "override") { blocks.insert(key, Cow::from(code)); } else { blocks.entry(key) .and_modify(|s| { *s += "\n"; *s += Cow::from(code) }) .or_insert(Cow::from(code)); } ``` ## Tangling: generating the source files To bootstrap the tangling process, a tangled version of the program is provided alongside the literate version. The executable can be compiled from the root of the project with ```sh cargo build --release ``` From now on you can make changes to the `README.md` file and use your latest compiled version of `pangler` to tangle and compile it. ## Weaving: generating the documentation As explained above we use [`pandoc`](https://pandoc.org/) as a weaver. Run the following command to generate a PDF file for this document ```sh pandoc --to latex \ --listings \ --number-sections \ --lua-filter=util/weaver.lua \ --output pangler.pdf \ README.md ``` The Lua filter `util/weaver.lua` is provided to handle custom `pangler` attributes during the PDF generation via the \LaTeX\ engine. ## Integration with (Neo)Vim (Neo)Vim supports code highlighting inside Markdown blocks, when the programming language is provided among its attributes. Add the following to your config file to enable code highlighting for a specific set of languages ```vimscript let g:markdown_fenced_languages = ['python','rust','scala'] ``` # Command Line Interface `pangler` offers a very simple command line interface. For an overview of the functionalities offered by the tool run ```sh pangler --help ``` `pangler` uses the `clap` library to parse command line arguments ```{#dependencies .toml} clap = { version = "3.1", features = ["derive"] } ``` ```{#uses .rust} use clap::Parser; ``` using the [Derive API](https://github.com/clap-rs/clap/blob/v3.1.18/examples/tutorial_derive/README.md) to define the exposed functionalities. The `struct` holding the CLI information is defined as follow ```{#config .rust} /// A tangler for Literate Programming in Pandoc #[derive(Parser, Debug)] #[clap(author, version, about, long_about = None)] struct Config { <> <> <> } ``` and the arguments are parsed as ```{#config_parse .rust} let config = Config::parse(); ``` `pangler` accepts a sequence of files that will be parsed, code will be collected and used to build the final program. Note that the order of the file provided on the CLI is important when using the [overriding functionality](#writing-programs). ```{#config_input .rust} /// Input files input: Vec, ``` By default, files are generated in the current working directory. ```{#constants .rust} const BASE: &str = "./"; ``` This behaviour can be overridden using the `-o`/`--output` flag. ```{#config_output .rust} /// Base output directory [default: './'] #[clap(short, long)] output: Option, ``` Finally, recursive substitution of blocks can lead to an infinite loop. By default, `pangler` will stop after 10 substitution iterations, but this parameter can be changed with the `-d`/`--depth` flag. ```{#config_depth .rust} /// Maximum substitution depth #[clap(short, long, default_value_t = 10)] depth: u32, ``` # The program The program is structured as a single Rust file with the following being the main entry point of the program ```{#main.rs .rust path="src/"} <> <> <> <> <> fn main() -> Result<()> { <> <> Ok(()) } ``` ## Pandoc We are using [`rust-pandoc`](https://github.com/oli-obk/rust-pandoc) and [`pandoc-ast`](https://github.com/oli-obk/pandoc-ast) to interact with `pandoc` from Rust. ```{#dependencies .toml} pandoc = "0.8" pandoc_ast = "0.8" ``` ```{#uses .rust} use pandoc::{ InputFormat,InputKind,OutputFormat,OutputKind,Pandoc }; use pandoc_ast::Block; ``` First we need to initialize a new `Pandoc` struct ```{#pandoc_setup .rust} let mut pandoc = Pandoc::new(); ``` and set up the input parameters. The input is a sequence of Markdown files passed as config options from the CLI. ```{#pandoc_setup .rust} pandoc.set_input(InputKind::Files(config.input)); pandoc.set_input_format(InputFormat::Markdown, vec![]); ``` The output is piped to stdout in JSON format. ```{#pandoc_setup .rust} pandoc.set_output(OutputKind::Pipe); pandoc.set_output_format(OutputFormat::Json, vec![]); ``` In this way, we will be able to pipe the output into a Pandoc filter that will collect the code snippets and build the codebase for us. ```{#pandoc_setup .rust} pandoc.add_filter( move |json| pandoc_ast::filter(json, |pandoc| { <> } ) ); pandoc.execute().unwrap(); ``` ## Pandoc filters Pandoc allows for the definition of [custom filters](https://pandoc.org/filters.html) to change the abstract syntax tree of a document. In this case we use a filter to collect code snippets from the input Markdown text into a `HashMap`, mapping code block identifiers to code block snippets. ```{#uses .rust} use std::borrow::Cow; use std::collections::HashMap; ``` ```{#types .rust} type Blocks<'a> = HashMap>; ``` Code blocks are wrapped into a [`Cow`](https://doc.rust-lang.org/stable/std/borrow/enum.Cow.html), i.e., a "copy-on-write" smart pointer, to avoid string duplication, unless strictly necessary. We iterate over all code blocks, along with their IDs, classes and attributes, collecting them ```{#pandoc_filter .rust} let mut blocks: Blocks = HashMap::new(); pandoc.blocks.iter().for_each(|block| if let Block::CodeBlock((id,clss,attrs), code) = block { <> } ); ``` And then we build the source code, making sure to cut off recursive code generation with depth larger than `config.depth`. ```{#pandoc_filter .rust} build(&config.output, &blocks, config.depth); ``` The filter returns the Pandoc JSON unchanged. ```{#pandoc_filter .rust} pandoc ``` ## Source code generation In order to build the source code from the gathered code block snippets, we need to recursively substitute *code macros* of the form `<>` with the corresponding code block. Code macros are matched with the following regex ```{#regex_macro .rust} static ref MACRO: Regex = Regex::new( r"(?m)^([[:blank:]]*)<<([^>\s]+)>>" ).unwrap(); ``` Note that, when matching the code macro, we keep track of its indentation as well, in order to properly indend code. Given a code macro, the following closure will compute the substituting block of code, properly indented. The input `Captures` structure is a vector with the regex capture groups, i.e., indentation and macro identifier, along with the full match in the first position. In case we reach the maximum allowed depth we truncate code block substitution and notify the user that something might not have been generated as expected. ```{#macro_closure .rust} |caps: &Captures| { if current_depth < max_depth { let block = blocks .get(&caps[2]) .expect("Block not present") .clone(); indent(block, caps[1].len()) } else { eprintln!("Reached maximum depth, \ output might be truncated.\n\ Increase `--depth` accordingly."); Cow::Owned(String::from("")) } } ``` As explained above, the building process iterates over all collected blocks and detects relevant entry points (files to generate) to start the recursive macro substitution. ```{#functions .rust} fn build( base: &Option, blocks: &Blocks, max_depth: u32 ) { <> blocks .iter() .for_each(|(path,code)| if PATH.is_match(path) { <> }) } ``` ### Recursive macro substitution The code generating algorithm went through multiple iterations and showed some interesting details of using `Cow`s. ```{#code_generation .rust} let mut current_depth = 0; let mut code = code.clone(); while MACRO.is_match(&code) { code = MACRO.replace_all( &code, <> ); current_depth += 1; } ``` The problem with this version is that, due to how `Cow` works, the value returned by `replace_all` cannot live longer than the borrowed `code` passed as a parameter. This is because the function returns a reference to `code` (`Cow::Borrowed`) if no replacement takes place, so for the returned value to be valid, `code` still needs to be available. But here, `code` gets overridden right away, so, in principle, if no replacement takes place `code` gets overridden by a reference to itself (losing data). However, note that this doesn't happen in practice (but the compiler doesn't know about this) because the `replace_all` function is applied as long as some replacement is possible (`while` condition). In other words, all calls to `replace_all` always return an `Cow::Owned` value. The problem is solved by a clever use of pattern matching ```{#code_generation .rust .override} let mut current_depth = 0; let mut code = code.clone(); while let Cow::Owned(new_code) = MACRO.replace_all( &code, <> ) { code = Cow::from(new_code); current_depth += 1; } ``` In this case, the matched `Cow::Owned` is not concerned by any lifetime (the type is `Cow<'_,str>`) of the borrowed value `code`. Moreover `code` takes ownership of `new_code: String` using the `Cow::from()` function. No heap allocation is performed, and the string is not copied. Finally, we write the code to file ```{#code_generation .rust} let file = base .clone() .unwrap_or(PathBuf::from(BASE)) .join(path); write_to_file(file, &code) .expect("Unable to write to file"); ``` ## Additional details ### Code indentation When (positive) code indentation is required, the processed block of code is indented by `indent`. ```{#indent_prefix .rust} let prefix = format!("{:indent$}", ""); ``` Each line is then `prefix`ed separately and the result is returned. ```{#functions .rust} fn indent<'a>( input: Cow<'a,str>, indent: usize ) -> Cow<'a,str> { if indent > 0 { <> let size = input.len() + indent*input.lines().count(); let mut output = String::with_capacity(size); input.lines().enumerate().for_each(|(i,line)| { if i > 0 { output.push('\n'); } if !line.is_empty() { output.push_str(&prefix); output.push_str(line); } }); Cow::Owned(output) } else { input } } ``` Note that, if no indentation is required (i.e., `indent` is equal to 0), no additional allocation is performed, and the `input` is returned as is. ### RegEx matching `pangler` uses the `regex` library to perform regular expression matching and substitution. Moreover, the library suggests the use of `lazy_static` to ensure that the regexes used are compiled exactly once per execution. ```{#dependencies .toml} lazy_static = "1.4" regex = "1.5" ``` ```{#uses .rust} use lazy_static::lazy_static; use regex::{Captures,Regex}; ``` We wrap the regex definition in a `lazy_static` macro ```{#regex_definition .rust} lazy_static! { <> <> } ``` ### Writing to file Writing to file is an operation performed using the Rust support for OS operations from the standard library. ```{#uses .rust} use std::fs; use std::io::Result; use std::path::PathBuf; ``` First, all necessary parent directories of `path` are created ```{#parent_directory_creation .rust} fs::create_dir_all(path.parent().unwrap())?; ``` and then the `content` is written to the file provided by ```{#write_to_file .rust} fs::write(path, content)?; ``` We perform a check on `path` and only write the content to the file if the path is relative to the current working directory. ```{#functions .rust} fn write_to_file( path: PathBuf, content: &str ) -> std::io::Result<()> { if path.is_relative() { <> <> } else { eprintln!( "Absolute paths not supported: {}", path.to_string_lossy() ) } Ok(()) } ``` # Credits `pangler` was created by Federico Igne (git@federicoigne.com) and available at [`https://git.dyamon.me/projects/pangler`](https://git.dyamon.me/projects/pangler). ```{#Cargo.toml .toml} [package] name = "pangler" version = "0.3.0" edition = "2021" [dependencies] <> ```