feat: add literate version of pangler

author: Federico Igne <git@federicoigne.com> 2022-06-08 20:58:43 +0100
committer: Federico Igne <git@federicoigne.com> 2022-06-08 20:58:43 +0100
commit: 7fb232b502e0ad06c139b64c1f2d541b79ab96df (patch)
tree: 9bad58e23b58101562f641c7ed37c77ce516a90b
parent: 24a2f4c09901863a3d4fbbda7f85eaebbf29c95f (diff)
download: pangler-7fb232b502e0ad06c139b64c1f2d541b79ab96df.tar.gz
pangler-7fb232b502e0ad06c139b64c1f2d541b79ab96df.zip
2 files changed, 612 insertions, 0 deletions
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..28a9eed
--- /dev/null
+++ b/README.md
@@ -0,0 +1,578 @@
+---
+title: "Pangler: literate programming in Pandoc"
+author: Federico Igne
+date: \today
+---
+This documents describes the logic and design of `pangler`, a minimal tangler for literate programming using the [Pandoc Markdown syntax](https://pandoc.org/MANUAL.html#pandocs-markdown).
+[Literate Programming](https://en.wikipedia.org/wiki/Literate_programming) (LP) is a programming paradigm that emphasize the natural flow of thoughts that the programmer experiences when writing software.
+The paradigm can be seen as "documentation first" and the focus is on "human-to-human" communication.
+The produced document is a text-based prose document describing the logic and the design of the program, interspersed with snippets of code that form the final software.
+Given an LP document, one can either extract the *tangled* code (with a "tangler") or generate its documentation, "woven" from the literate input source (with a "weaver").
+In this case, [Pandoc](https://pandoc.org) is a very good weaver, supporting the generation of different document formats from a Markdown source.
+This document is an attempt at providing a tangler working alongside Pandoc.
+`pangler` is itself written in Pandoc Markdown format and can be generated from this document using itself.
+# Literate programming with `pangler`
+`pangler` uses two main features provided by the Pandoc Markdown syntax, which are not necessarily present in other Markdown flavours:
+1. [`backtick_code_blocks`](https://pandoc.org/MANUAL.html#extension-backtick_code_blocks) for writing fenced snippets of code, and
+2. [`fenced_code_attributes`](https://pandoc.org/MANUAL.html#extension-fenced_code_attributes) for adding arbitrary HTML attributes, classes and ID to a snippet of code.
+## Writing programs
+In the following, we indicate a *literate program* as a markdown file written in Pandoc Markdown syntax.
+A minimal block of code, recognized by `pangler`, with ID `identifier` is
+~~~
+```{#identifier}
+[code snippet]
+```
+~~~
+Code blocks can contain **code macros** of the form `<<identifier>>` where `identifier` is a valid code block ID.
+Code macros will be recursively substituted by the corresponding code snippet during [code generation][Tangling: generating the source files].
+A code macro needs to be placed in its own line, with an optional (whitespace) indentation, used during code generation to indent the code snippet.
+Additional attributes and classes can be added to a code block, as well;
+the language of the code snippet can be provided and is useful to enable correct syntax highlighting.
+~~~
+```{#identifier .python}
+[python code snippet]
+```
+~~~
+An identifier can also be a file name matching the following regex
+```{#regex_path .rust}
+static ref PATH: Regex =
+  Regex::new(
+    r"^(?:[[:word:]\.-]+/)*[[:word:]\.-]+\.[[:alpha:]]+$"
+  ).unwrap();
+```
+In that case the code block is considered a valid **entry point** for the generation of a file with that name.
+The code block defines the content of the new file.
+~~~
+```{#file.py .python}
+[python main file]
+```
+~~~
+File names can be generated in subfolders using the `path` attribute.
+The following code block determines the content of file `path/to/file.py`.
+~~~
+```{#file.py .python path="path/to/"}
+[python main file]
+```
+~~~
+This path is relative to the current working directory, unless [the `-o`/`--output` flag is used][Command Line Interface].
+Code blocks without an ID are ignored.
+```{#code_block_gathering .rust}
+if !id.is_empty() {
+  let key = {
+    let path = attrs.iter().find(|(k,_)| k == "path");
+    if let Some(path) = path {
+      format!("{}{}", path.1, id) 
+    } else {
+      id.to_string()
+    }
+  };
+  <<code_block>>
+} else {
+  eprintln!("Ignoring code block without ID:");
+  eprintln!("{}", indent(Cow::from(code),4));
+}
+```
+Code blocks are processed in order.
+By default, if an identifier is already defined, the code block is appended to the current corresponding value.
+Use the `override` class in the code block definition to cause the block to override the previous entry with the same key, if this exists.
+~~~
+```{#identifier .python .override}
+[Python code snippet]
+```
+~~~
+This is handled in code as follows
+```{#code_block .rust}
+if clss.iter().any(|c| c == "override") {
+  blocks.insert(key, Cow::from(code));
+} else {
+  blocks.entry(key)
+        .and_modify(|s| {
+          *s += "\n";
+          *s += Cow::from(code)
+        })
+        .or_insert(Cow::from(code));
+}
+```
+## Tangling: generating the source files
+To bootstrap the tangling process, an early version of `pangler` is provided under `bin/` in this repository.
+You can generate the code for the current version of the program, in the current working directory, with
+```sh
+./bin/pangler-v0.1.0 README.md
+```
+and compile it with
+```sh
+cargo build --release
+```
+From now on you can make changes to the `README.md` file and use the latest version of `pangler` to tangle and compile it.
+## Weaving: generating the documentation
+As explained above we use [`pandoc`](https://pandoc.org/) as a weaver.
+Run the following command to generate a PDF file for this document
+```sh
+pandoc --to latex \
+       --listings \
+       --number-sections \
+       --lua-filter=util/weaver.lua \
+       --output pangler.pdf \
+       README.md
+```
+The Lua filter `util/weaver.lua` is provided to handle custom `pangler` attributes during the PDF generation via the \LaTeX\ engine.
+## Integration with (Neo)Vim
+(Neo)Vim supports code highlighting inside Markdown blocks, when the programming language is provided among its attributes.
+Add the following to your config file to enable code highlighting for a specific set of languages
+```vimscript
+let g:markdown_fenced_languages =
+  ['python','rust','scala']
+```
+# Command Line Interface
+`pangler` offers a very simple command line interface.
+For an overview of the functionalities offered by the tool run
+```sh
+pangler --help
+```
+`pangler` uses the `clap` library to parse command line arguments
+```{#dependencies .toml}
+clap = { version = "3.1", features = ["derive"] }
+```
+```{#uses .rust}
+use clap::Parser;
+```
+using the [Derive API](https://github.com/clap-rs/clap/blob/v3.1.18/examples/tutorial_derive/README.md) to define the exposed functionalities.
+The `struct` holding the CLI information is defined as follow
+```{#config .rust}
+/// A tangler for Literate Programming in Pandoc
+#[derive(Parser, Debug)]
+#[clap(author, version, about, long_about = None)]
+struct Config {
+  <<config_depth>>
+  <<config_output>>
+  <<config_input>>
+}
+```
+and the arguments are parsed as
+```{#config_parse .rust}
+let config = Config::parse();
+```
+`pangler` accepts a sequence of files that will be parsed, code will be collected and used to build the final program.
+Note that the order of the file provided on the CLI is important when using the [overriding functionality][Writing programs].
+```{#config_input .rust}
+/// Input files
+input: Vec<PathBuf>,
+```
+By default, files are generated in the current working directory.
+```{#constants .rust}
+const BASE: &str = "./";
+```
+This behaviour can be overridden using the `-o`/`--output` flag.
+```{#config_output .rust}
+/// Base output directory [default: './']
+#[clap(short, long)]
+output: Option<PathBuf>,
+```
+Finally, recursive substitution of blocks can lead to an infinite loop.
+By default, `pangler` will stop after 10 substitution iterations, but this parameter can be changed with the `-d`/`--depth` flag.
+```{#config_depth .rust}
+/// Maximum substitution depth
+#[clap(short, long, default_value_t = 10)]
+depth: u32,
+```
+# The program
+The program is structured as a single Rust file with the following being the main entry point of the program
+```{#main.rs .rust path="src/"}
+<<uses>>
+<<constants>>
+<<config>>
+<<types>>
+<<functions>>
+fn main() -> Result<()> {
+  <<config_parse>>
+  <<pandoc_setup>>
+  Ok(())
+}
+```
+## Pandoc
+We are using [`rust-pandoc`](https://github.com/oli-obk/rust-pandoc) and [`pandoc-ast`](https://github.com/oli-obk/pandoc-ast) to interact with `pandoc` from Rust.
+```{#dependencies .toml}
+pandoc = "0.8"
+pandoc_ast = "0.8"
+```
+```{#uses .rust}
+use pandoc::{
+  InputFormat,InputKind,OutputFormat,OutputKind,Pandoc
+};
+use pandoc_ast::Block;
+```
+First we need to initialize a new `Pandoc` struct
+```{#pandoc_setup .rust}
+let mut pandoc = Pandoc::new();
+```
+and set up the input parameters.
+The input is a sequence of Markdown files passed as config options from the CLI.
+```{#pandoc_setup .rust}
+pandoc.set_input(InputKind::Files(config.input));
+pandoc.set_input_format(InputFormat::Markdown, vec![]);
+```
+The output is piped to stdout in JSON format.
+```{#pandoc_setup .rust}
+pandoc.set_output(OutputKind::Pipe);
+pandoc.set_output_format(OutputFormat::Json, vec![]);
+```
+In this way, we will be able to pipe the output into a Pandoc filter that will collect the code snippets and build the codebase for us.
+```{#pandoc_setup .rust}
+pandoc.add_filter(
+  move |json| pandoc_ast::filter(json,
+    |pandoc| {
+      <<pandoc_filter>>
+    }
+  )
+);
+pandoc.execute().unwrap();
+```
+## Pandoc filters
+Pandoc allows for the definition of [custom filters](https://pandoc.org/filters.html) to change the abstract syntax tree of a document.
+In this case we use a filter to collect code snippets from the input Markdown text into a `HashMap`, mapping code block identifiers to code block snippets.
+```{#uses .rust}
+use std::borrow::Cow;
+use std::collections::HashMap;
+```
+```{#types .rust}
+type Blocks<'a> = HashMap<String,Cow<'a,str>>;
+```
+Code blocks are wrapped into a [`Cow`](https://doc.rust-lang.org/stable/std/borrow/enum.Cow.html), i.e., a "copy-on-write" smart pointer, to avoid string duplication, unless strictly necessary.
+We iterate over all code blocks, along with their IDs, classes and attributes, collecting them
+```{#pandoc_filter .rust}
+let mut blocks: Blocks = HashMap::new();
+pandoc.blocks.iter().for_each(|block|
+  if let Block::CodeBlock((id,clss,attrs), code) = block {
+    <<code_block_gathering>>
+  }
+);
+```
+And then we build the source code, making sure to cut off recursive code generation with depth larger than `config.depth`.
+```{#pandoc_filter .rust}
+build(&config.output, &blocks, config.depth);
+```
+The filter returns the Pandoc JSON unchanged.
+```{#pandoc_filter .rust}
+pandoc
+```
+## Source code generation
+In order to build the source code from the gathered code block snippets, we need to recursively substitute *code macros* of the form `<<identifier>>` with the corresponding code block.
+Code macros are matched with the following regex
+```{#regex_macro .rust}
+static ref MACRO: Regex =
+  Regex::new(
+    r"(?m)^([[:blank:]]*)<<([^>\s]+)>>"
+  ).unwrap();
+```
+Note that, when matching the code macro, we keep track of its indentation as well, in order to properly indend code.
+Given a code macro, the following closure will compute the substituting block of code, properly indented.
+The input `Captures` structure is a vector with the regex capture groups, i.e., indentation and macro identifier, along with the full match in the first position.
+In case we reach the maximum allowed depth we truncate code block substitution and notify the user that something might not have been generated as expected.
+```{#macro_closure .rust}
+|caps: &Captures| {
+  if current_depth < max_depth {
+    let block = blocks
+      .get(&caps[2])
+      .expect("Block not present")
+      .clone();
+    indent(block, caps[1].len())
+  } else {
+    eprintln!("Reached maximum depth, \
+               output might be truncated.\n\
+               Increase `--depth` accordingly.");
+    Cow::Owned(String::from(""))
+  }
+}
+```
+As explained above, the building process iterates over all collected blocks and detects relevant entry points (files to generate) to start the recursive macro substitution.
+```{#functions .rust}
+fn build(
+  base: &Option<PathBuf>,
+  blocks: &Blocks,
+  max_depth: u32
+) {
+  <<regex_definition>>
+  blocks
+    .iter()
+    .for_each(|(path,code)| if PATH.is_match(path) { 
+      <<code_generation>>
+    })
+}
+```
+### Recursive macro substitution
+The code generating algorithm went through multiple iterations and showed some interesting details of using `Cow`s.
+```{#code_generation .rust}
+let mut current_depth = 0;
+let mut code = code.clone();
+while MACRO.is_match(&code) {
+  code = MACRO.replace_all(
+    &code,
+    <<macro_closure>>
+  );
+  current_depth += 1;
+}
+```
+The problem with this version is that, due to how `Cow` works, the value returned by `replace_all` cannot live longer than the borrowed `code` passed as a parameter.
+This is because the function returns a reference to `code` (`Cow::Borrowed`) if no replacement takes place, so for the returned value to be valid, `code` still needs to be available.
+But here, `code` gets overridden right away, so, in principle, if no replacement takes place `code` gets overridden by a reference to itself (losing data).
+However, note that this doesn't happen in practice (but the compiler doesn't know about this) because the `replace_all` function is applied as long as some replacement is possible (`while`
+condition).
+In other words, all calls to `replace_all` always return an `Cow::Owned` value.
+The problem is solved by a clever use of pattern matching
+```{#code_generation .rust .override}
+let mut current_depth = 0;
+let mut code = code.clone();
+while let Cow::Owned(new_code) = MACRO.replace_all(
+  &code,
+  <<macro_closure>>
+) {
+  code = Cow::from(new_code);
+  current_depth += 1;
+}
+```
+In this case, the matched `Cow::Owned` is not concerned by any lifetime (the type is `Cow<'_,str>`) of the borrowed value `code`.
+Moreover `code` takes ownership of `new_code: String` using the `Cow::from()` function.
+No heap allocation is performed, and the string is not copied.
+Finally, we write the code to file
+```{#code_generation .rust}
+let file = base
+  .clone()
+  .unwrap_or(PathBuf::from(BASE))
+  .join(path);
+write_to_file(file, &code)
+  .expect("Unable to write to file");
+```
+## Additional details
+### Code indentation
+When (positive) code indentation is required, the processed block of code is indented by `indent`.
+```{#indent_prefix .rust}
+let prefix = format!("{:indent$}", "");
+```
+Each line is then `prefix`ed separately and the result is returned.
+```{#functions .rust}
+fn indent<'a>(
+  input: Cow<'a,str>,
+  indent: usize
+) -> Cow<'a,str> {
+  if indent > 0 {
+    <<indent_prefix>>
+    let size = input.len() + indent*input.lines().count();
+    let mut output = String::with_capacity(size);
+    input.lines().enumerate().for_each(|(i,line)| {
+      if i > 0 {
+        output.push('\n');
+      }
+      if !line.is_empty() {
+        output.push_str(&prefix);
+        output.push_str(line);
+      }
+    });
+    Cow::Owned(output)
+  } else {
+    input
+  }
+}
+```
+Note that, if no indentation is required (i.e., `indent` is equal to 0), no additional allocation is performed, and the `input` is returned as is.
+### RegEx matching
+`pangler` uses the `regex` library to perform regular expression matching and substitution.
+Moreover, the library suggests the use of `lazy_static` to ensure that the regexes used are compiled exactly once per execution.
+```{#dependencies .toml}
+lazy_static = "1.4"
+regex = "1.5"
+```
+```{#uses .rust}
+use lazy_static::lazy_static;
+use regex::{Captures,Regex};
+```
+We wrap the regex definition in a `lazy_static` macro
+```{#regex_definition .rust}
+lazy_static! {
+  <<regex_path>>
+  <<regex_macro>>
+}
+```
+### Writing to file
+Writing to file is an operation performed using the Rust support for OS operations from the standard library.
+```{#uses .rust}
+use std::fs;
+use std::io::Result;
+use std::path::PathBuf;
+```
+First, all necessary parent directories of `path` are created
+```{#parent_directory_creation .rust}
+fs::create_dir_all(path.parent().unwrap())?;
+```
+and then the `content` is written to the file provided by 
+    
+```{#write_to_file .rust}
+fs::write(path, content)?;
+```
+We perform a check on `path` and only write the content to the file if the path is relative to the current working directory.
+```{#functions .rust}
+fn write_to_file(
+  path: PathBuf, content: &str
+) -> std::io::Result<()> {
+  if path.is_relative() {
+    <<parent_directory_creation>>
+    <<write_to_file>>
+  } else { 
+    eprintln!(
+      "Absolute paths not supported: {}",
+      path.to_string_lossy()
+    )
+  }
+  Ok(())
+}
+```
+# Credits
+`pangler v0.2.0` was created by Federico Igne (git@federicoigne.com) and available at [`https://git.dyamon.me/projects/pangler`](https://git.dyamon.me/projects/pangler).
+```{#Cargo.toml .toml}
+[package]
+name = "pangler"
+version = "0.2.0"
+edition = "2021"
+[dependencies]
+<<dependencies>>
+```
diff --git a/util/weaver.lua b/util/weaver.lua
new file mode 100644
index 0000000..1159988
--- /dev/null
+++ b/util/weaver.lua
@@ -0,0 +1,34 @@
+if FORMAT:match 'latex' then
+  -- Setting custom `listings` style
+  function Meta(m)
+    m["header-includes"] = pandoc.MetaBlocks({pandoc.RawBlock("latex",[[
+        \lstdefinestyle{weaver}{
+            basicstyle=\small\ttfamily,
+            backgroundcolor=\color{gray!10},
+            xleftmargin=0.5cm,
+            numbers=left,
+            numbersep=5pt,
+            numberstyle=\tiny\color{gray},
+            captionpos=b
+        }
+        \lstset{style=weaver}
+    ]])})
+    return m
+  end
+  function CodeBlock(b)
+    -- Remove `path` attribute and merge it with `id`
+    if b.attributes.path and b.identifier then
+      b.identifier = b.attributes.path .. b.identifier
+      b.attributes.path = nil
+    end
+    -- Add ID to caption
+    if b.identifier then
+      if b.attributes.caption then
+        b.attributes.caption = b.identifier .. ": " .. b.attributes.caption
+      else
+        b.attributes.caption = b.identifier
+      end
+    end
+    return b
+  end
+end
author	Federico Igne <git@federicoigne.com>	2022-06-08 20:58:43 +0100
committer	Federico Igne <git@federicoigne.com>	2022-06-08 20:58:43 +0100
commit	7fb232b502e0ad06c139b64c1f2d541b79ab96df (patch)
tree	9bad58e23b58101562f641c7ed37c77ce516a90b
parent	24a2f4c09901863a3d4fbbda7f85eaebbf29c95f (diff)
download	pangler-7fb232b502e0ad06c139b64c1f2d541b79ab96df.tar.gz pangler-7fb232b502e0ad06c139b64c1f2d541b79ab96df.zip

diff --git a/README.md b/README.md new file mode 100644 index 0000000..28a9eed --- /dev/null +++ b/README.md
@@ -0,0 +1,578 @@
	1	---
	2	title: "Pangler: literate programming in Pandoc"
	3	author: Federico Igne
	4	date: \today
	5	---
	6
	7	This documents describes the logic and design of `pangler`, a minimal tangler for literate programming using the [Pandoc Markdown syntax](https://pandoc.org/MANUAL.html#pandocs-markdown).
	8
	9	[Literate Programming](https://en.wikipedia.org/wiki/Literate_programming) (LP) is a programming paradigm that emphasize the natural flow of thoughts that the programmer experiences when writing software.
	10	The paradigm can be seen as "documentation first" and the focus is on "human-to-human" communication.
	11	The produced document is a text-based prose document describing the logic and the design of the program, interspersed with snippets of code that form the final software.
	12
	13	Given an LP document, one can either extract the tangled code (with a "tangler") or generate its documentation, "woven" from the literate input source (with a "weaver").
	14
	15	In this case, [Pandoc](https://pandoc.org) is a very good weaver, supporting the generation of different document formats from a Markdown source.
	16	This document is an attempt at providing a tangler working alongside Pandoc.
	17	`pangler` is itself written in Pandoc Markdown format and can be generated from this document using itself.
	18
	19	# Literate programming with `pangler`
	20
	21	`pangler` uses two main features provided by the Pandoc Markdown syntax, which are not necessarily present in other Markdown flavours:
	22
	23	1. [`backtick_code_blocks`](https://pandoc.org/MANUAL.html#extension-backtick_code_blocks) for writing fenced snippets of code, and
	24	2. [`fenced_code_attributes`](https://pandoc.org/MANUAL.html#extension-fenced_code_attributes) for adding arbitrary HTML attributes, classes and ID to a snippet of code.
	25
	26	## Writing programs
	27
	28	In the following, we indicate a literate program as a markdown file written in Pandoc Markdown syntax.
	29
	30	A minimal block of code, recognized by `pangler`, with ID `identifier` is
	31
	32	~~~
	33	```{#identifier}
	34	[code snippet]
	35	```
	36	~~~
	37
	38	Code blocks can contain code macros of the form `<<identifier>>` where `identifier` is a valid code block ID.
	39	Code macros will be recursively substituted by the corresponding code snippet during [code generation][Tangling: generating the source files].
	40	A code macro needs to be placed in its own line, with an optional (whitespace) indentation, used during code generation to indent the code snippet.
	41
	42	Additional attributes and classes can be added to a code block, as well;
	43	the language of the code snippet can be provided and is useful to enable correct syntax highlighting.
	44
	45	~~~
	46	```{#identifier .python}
	47	[python code snippet]
	48	```
	49	~~~
	50
	51	An identifier can also be a file name matching the following regex
	52
	53	```{#regex_path .rust}
	54	static ref PATH: Regex =
	55	Regex::new(
	56	r"^(?:[[:word:]\.-]+/)*[[:word:]\.-]+\.[[:alpha:]]+$"
	57	).unwrap();
	58	```
	59
	60	In that case the code block is considered a valid entry point for the generation of a file with that name.
	61	The code block defines the content of the new file.
	62
	63	~~~
	64	```{#file.py .python}
	65	[python main file]
	66	```
	67	~~~
	68
	69	File names can be generated in subfolders using the `path` attribute.
	70	The following code block determines the content of file `path/to/file.py`.
	71
	72	~~~
	73	```{#file.py .python path="path/to/"}
	74	[python main file]
	75	```
	76	~~~
	77
	78	This path is relative to the current working directory, unless [the `-o`/`--output` flag is used][Command Line Interface].
	79
	80	Code blocks without an ID are ignored.
	81
	82	```{#code_block_gathering .rust}
	83	if !id.is_empty() {
	84	let key = {
	85	let path = attrs.iter().find(\|(k,_)\| k == "path");
	86	if let Some(path) = path {
	87	format!("{}{}", path.1, id)
	88	} else {
	89	id.to_string()
	90	}
	91	};
	92	<<code_block>>
	93	} else {
	94	eprintln!("Ignoring code block without ID:");
	95	eprintln!("{}", indent(Cow::from(code),4));
	96	}
	97	```
	98
	99	Code blocks are processed in order.
	100	By default, if an identifier is already defined, the code block is appended to the current corresponding value.
	101
	102	Use the `override` class in the code block definition to cause the block to override the previous entry with the same key, if this exists.
	103
	104	~~~
	105	```{#identifier .python .override}
	106	[Python code snippet]
	107	```
	108	~~~
	109
	110	This is handled in code as follows
	111
	112	```{#code_block .rust}
	113	if clss.iter().any(\|c\| c == "override") {
	114	blocks.insert(key, Cow::from(code));
	115	} else {
	116	blocks.entry(key)
	117	.and_modify(\|s\| {
	118	*s += "\n";
	119	*s += Cow::from(code)
	120	})
	121	.or_insert(Cow::from(code));
	122	}
	123	```
	124
	125	## Tangling: generating the source files
	126
	127	To bootstrap the tangling process, an early version of `pangler` is provided under `bin/` in this repository.
	128
	129	You can generate the code for the current version of the program, in the current working directory, with
	130
	131	```sh
	132	./bin/pangler-v0.1.0 README.md
	133	```
	134
	135	and compile it with
	136
	137	```sh
	138	cargo build --release
	139	```
	140
	141	From now on you can make changes to the `README.md` file and use the latest version of `pangler` to tangle and compile it.
	142
	143	## Weaving: generating the documentation
	144
	145	As explained above we use [`pandoc`](https://pandoc.org/) as a weaver.
	146	Run the following command to generate a PDF file for this document
	147
	148	```sh
	149	pandoc --to latex \
	150	--listings \
	151	--number-sections \
	152	--lua-filter=util/weaver.lua \
	153	--output pangler.pdf \
	154	README.md
	155	```
	156
	157	The Lua filter `util/weaver.lua` is provided to handle custom `pangler` attributes during the PDF generation via the \LaTeX\ engine.
	158
	159	## Integration with (Neo)Vim
	160
	161	(Neo)Vim supports code highlighting inside Markdown blocks, when the programming language is provided among its attributes.
	162	Add the following to your config file to enable code highlighting for a specific set of languages
	163
	164	```vimscript
	165	let g:markdown_fenced_languages =
	166	['python','rust','scala']
	167	```
	168
	169	# Command Line Interface
	170
	171	`pangler` offers a very simple command line interface.
	172	For an overview of the functionalities offered by the tool run
	173
	174	```sh
	175	pangler --help
	176	```
	177
	178	`pangler` uses the `clap` library to parse command line arguments
	179
	180	```{#dependencies .toml}
	181	clap = { version = "3.1", features = ["derive"] }
	182	```
	183
	184	```{#uses .rust}
	185	use clap::Parser;
	186	```
	187
	188	using the [Derive API](https://github.com/clap-rs/clap/blob/v3.1.18/examples/tutorial_derive/README.md) to define the exposed functionalities.
	189	The `struct` holding the CLI information is defined as follow
	190
	191	```{#config .rust}
	192	/// A tangler for Literate Programming in Pandoc
	193	#[derive(Parser, Debug)]
	194	#[clap(author, version, about, long_about = None)]
	195	struct Config {
	196	<<config_depth>>
	197	<<config_output>>
	198	<<config_input>>
	199	}
	200	```
	201
	202	and the arguments are parsed as
	203
	204	```{#config_parse .rust}
	205	let config = Config::parse();
	206	```
	207
	208	`pangler` accepts a sequence of files that will be parsed, code will be collected and used to build the final program.
	209	Note that the order of the file provided on the CLI is important when using the [overriding functionality][Writing programs].
	210
	211	```{#config_input .rust}
	212	/// Input files
	213	input: Vec<PathBuf>,
	214	```
	215
	216	By default, files are generated in the current working directory.
	217
	218	```{#constants .rust}
	219	const BASE: &str = "./";
	220	```
	221
	222	This behaviour can be overridden using the `-o`/`--output` flag.
	223
	224	```{#config_output .rust}
	225	/// Base output directory [default: './']
	226	#[clap(short, long)]
	227	output: Option<PathBuf>,
	228	```
	229
	230	Finally, recursive substitution of blocks can lead to an infinite loop.
	231	By default, `pangler` will stop after 10 substitution iterations, but this parameter can be changed with the `-d`/`--depth` flag.
	232
	233	```{#config_depth .rust}
	234	/// Maximum substitution depth
	235	#[clap(short, long, default_value_t = 10)]
	236	depth: u32,
	237	```
	238
	239	# The program
	240
	241	The program is structured as a single Rust file with the following being the main entry point of the program
	242
	243	```{#main.rs .rust path="src/"}
	244	<<uses>>
	245
	246	<<constants>>
	247
	248	<<config>>
	249
	250	<<types>>
	251
	252	<<functions>>
	253
	254	fn main() -> Result<()> {
	255	<<config_parse>>
	256	<<pandoc_setup>>
	257	Ok(())
	258	}
	259	```
	260
	261	## Pandoc
	262
	263	We are using [`rust-pandoc`](https://github.com/oli-obk/rust-pandoc) and [`pandoc-ast`](https://github.com/oli-obk/pandoc-ast) to interact with `pandoc` from Rust.
	264
	265	```{#dependencies .toml}
	266	pandoc = "0.8"
	267	pandoc_ast = "0.8"
	268	```
	269
	270	```{#uses .rust}
	271	use pandoc::{
	272	InputFormat,InputKind,OutputFormat,OutputKind,Pandoc
	273	};
	274	use pandoc_ast::Block;
	275	```
	276
	277	First we need to initialize a new `Pandoc` struct
	278
	279	```{#pandoc_setup .rust}
	280	let mut pandoc = Pandoc::new();
	281	```
	282
	283	and set up the input parameters.
	284	The input is a sequence of Markdown files passed as config options from the CLI.
	285
	286	```{#pandoc_setup .rust}
	287	pandoc.set_input(InputKind::Files(config.input));
	288	pandoc.set_input_format(InputFormat::Markdown, vec![]);
	289	```
	290
	291	The output is piped to stdout in JSON format.
	292
	293	```{#pandoc_setup .rust}
	294	pandoc.set_output(OutputKind::Pipe);
	295	pandoc.set_output_format(OutputFormat::Json, vec![]);
	296	```
	297
	298	In this way, we will be able to pipe the output into a Pandoc filter that will collect the code snippets and build the codebase for us.
	299
	300	```{#pandoc_setup .rust}
	301	pandoc.add_filter(
	302	move \|json\| pandoc_ast::filter(json,
	303	\|pandoc\| {
	304	<<pandoc_filter>>
	305	}
	306	)
	307	);
	308	pandoc.execute().unwrap();
	309	```
	310
	311	## Pandoc filters
	312
	313	Pandoc allows for the definition of [custom filters](https://pandoc.org/filters.html) to change the abstract syntax tree of a document.
	314
	315	In this case we use a filter to collect code snippets from the input Markdown text into a `HashMap`, mapping code block identifiers to code block snippets.
	316
	317	```{#uses .rust}
	318	use std::borrow::Cow;
	319	use std::collections::HashMap;
	320	```
	321
	322	```{#types .rust}
	323	type Blocks<'a> = HashMap<String,Cow<'a,str>>;
	324	```
	325
	326	Code blocks are wrapped into a [`Cow`](https://doc.rust-lang.org/stable/std/borrow/enum.Cow.html), i.e., a "copy-on-write" smart pointer, to avoid string duplication, unless strictly necessary.
	327
	328	We iterate over all code blocks, along with their IDs, classes and attributes, collecting them
	329
	330	```{#pandoc_filter .rust}
	331	let mut blocks: Blocks = HashMap::new();
	332	pandoc.blocks.iter().for_each(\|block\|
	333	if let Block::CodeBlock((id,clss,attrs), code) = block {
	334	<<code_block_gathering>>
	335	}
	336	);
	337	```
	338
	339	And then we build the source code, making sure to cut off recursive code generation with depth larger than `config.depth`.
	340
	341	```{#pandoc_filter .rust}
	342	build(&config.output, &blocks, config.depth);
	343	```
	344
	345	The filter returns the Pandoc JSON unchanged.
	346
	347	```{#pandoc_filter .rust}
	348	pandoc
	349	```
	350
	351	## Source code generation
	352
	353	In order to build the source code from the gathered code block snippets, we need to recursively substitute code macros of the form `<<identifier>>` with the corresponding code block.
	354
	355	Code macros are matched with the following regex
	356
	357	```{#regex_macro .rust}
	358	static ref MACRO: Regex =
	359	Regex::new(
	360	r"(?m)^([[:blank:]]*)<<([^>\s]+)>>"
	361	).unwrap();
	362	```
	363
	364	Note that, when matching the code macro, we keep track of its indentation as well, in order to properly indend code.
	365
	366	Given a code macro, the following closure will compute the substituting block of code, properly indented.
	367	The input `Captures` structure is a vector with the regex capture groups, i.e., indentation and macro identifier, along with the full match in the first position.
	368
	369	In case we reach the maximum allowed depth we truncate code block substitution and notify the user that something might not have been generated as expected.
	370
	371	```{#macro_closure .rust}
	372	\|caps: &Captures\| {
	373	if current_depth < max_depth {
	374	let block = blocks
	375	.get(&caps[2])
	376	.expect("Block not present")
	377	.clone();
	378	indent(block, caps[1].len())
	379	} else {
	380	eprintln!("Reached maximum depth, \
	381	output might be truncated.\n\
	382	Increase `--depth` accordingly.");
	383	Cow::Owned(String::from(""))
	384	}
	385	}
	386	```
	387
	388	As explained above, the building process iterates over all collected blocks and detects relevant entry points (files to generate) to start the recursive macro substitution.
	389
	390	```{#functions .rust}
	391	fn build(
	392	base: &Option<PathBuf>,
	393	blocks: &Blocks,
	394	max_depth: u32
	395	) {
	396	<<regex_definition>>
	397	blocks
	398	.iter()
	399	.for_each(\|(path,code)\| if PATH.is_match(path) {
	400	<<code_generation>>
	401	})
	402	}
	403
	404	```
	405
	406	### Recursive macro substitution
	407
	408	The code generating algorithm went through multiple iterations and showed some interesting details of using `Cow`s.
	409
	410	```{#code_generation .rust}
	411	let mut current_depth = 0;
	412	let mut code = code.clone();
	413	while MACRO.is_match(&code) {
	414	code = MACRO.replace_all(
	415	&code,
	416	<<macro_closure>>
	417	);
	418	current_depth += 1;
	419	}
	420	```
	421
	422	The problem with this version is that, due to how `Cow` works, the value returned by `replace_all` cannot live longer than the borrowed `code` passed as a parameter.
	423	This is because the function returns a reference to `code` (`Cow::Borrowed`) if no replacement takes place, so for the returned value to be valid, `code` still needs to be available.
	424	But here, `code` gets overridden right away, so, in principle, if no replacement takes place `code` gets overridden by a reference to itself (losing data).
	425
	426	However, note that this doesn't happen in practice (but the compiler doesn't know about this) because the `replace_all` function is applied as long as some replacement is possible (`while`
	427	condition).
	428	In other words, all calls to `replace_all` always return an `Cow::Owned` value.
	429
	430	The problem is solved by a clever use of pattern matching
	431
	432	```{#code_generation .rust .override}
	433	let mut current_depth = 0;
	434	let mut code = code.clone();
	435	while let Cow::Owned(new_code) = MACRO.replace_all(
	436	&code,
	437	<<macro_closure>>
	438	) {
	439	code = Cow::from(new_code);
	440	current_depth += 1;
	441	}
	442	```
	443
	444	In this case, the matched `Cow::Owned` is not concerned by any lifetime (the type is `Cow<'_,str>`) of the borrowed value `code`.
	445	Moreover `code` takes ownership of `new_code: String` using the `Cow::from()` function.
	446	No heap allocation is performed, and the string is not copied.
	447
	448	Finally, we write the code to file
	449
	450	```{#code_generation .rust}
	451	let file = base
	452	.clone()
	453	.unwrap_or(PathBuf::from(BASE))
	454	.join(path);
	455	write_to_file(file, &code)
	456	.expect("Unable to write to file");
	457	```
	458
	459	## Additional details
	460
	461	### Code indentation
	462
	463	When (positive) code indentation is required, the processed block of code is indented by `indent`.
	464
	465	```{#indent_prefix .rust}
	466	let prefix = format!("{:indent$}", "");
	467	```
	468
	469	Each line is then `prefix`ed separately and the result is returned.
	470
	471	```{#functions .rust}
	472	fn indent<'a>(
	473	input: Cow<'a,str>,
	474	indent: usize
	475	) -> Cow<'a,str> {
	476	if indent > 0 {
	477	<<indent_prefix>>
	478	let size = input.len() + indent*input.lines().count();
	479	let mut output = String::with_capacity(size);
	480	input.lines().enumerate().for_each(\|(i,line)\| {
	481	if i > 0 {
	482	output.push('\n');
	483	}
	484	if !line.is_empty() {
	485	output.push_str(&prefix);
	486	output.push_str(line);
	487	}
	488	});
	489	Cow::Owned(output)
	490	} else {
	491	input
	492	}
	493	}
	494
	495	```
	496
	497	Note that, if no indentation is required (i.e., `indent` is equal to 0), no additional allocation is performed, and the `input` is returned as is.
	498
	499	### RegEx matching
	500
	501	`pangler` uses the `regex` library to perform regular expression matching and substitution.
	502	Moreover, the library suggests the use of `lazy_static` to ensure that the regexes used are compiled exactly once per execution.
	503
	504
	505	```{#dependencies .toml}
	506	lazy_static = "1.4"
	507	regex = "1.5"
	508	```
	509
	510	```{#uses .rust}
	511	use lazy_static::lazy_static;
	512	use regex::{Captures,Regex};
	513	```
	514
	515	We wrap the regex definition in a `lazy_static` macro
	516
	517	```{#regex_definition .rust}
	518	lazy_static! {
	519	<<regex_path>>
	520	<<regex_macro>>
	521	}
	522	```
	523
	524	### Writing to file
	525
	526	Writing to file is an operation performed using the Rust support for OS operations from the standard library.
	527
	528	```{#uses .rust}
	529	use std::fs;
	530	use std::io::Result;
	531	use std::path::PathBuf;
	532	```
	533
	534	First, all necessary parent directories of `path` are created
	535
	536	```{#parent_directory_creation .rust}
	537	fs::create_dir_all(path.parent().unwrap())?;
	538	```
	539
	540	and then the `content` is written to the file provided by
	541
	542	```{#write_to_file .rust}
	543	fs::write(path, content)?;
	544	```
	545
	546	We perform a check on `path` and only write the content to the file if the path is relative to the current working directory.
	547
	548	```{#functions .rust}
	549	fn write_to_file(
	550	path: PathBuf, content: &str
	551	) -> std::io::Result<()> {
	552	if path.is_relative() {
	553	<<parent_directory_creation>>
	554	<<write_to_file>>
	555	} else {
	556	eprintln!(
	557	"Absolute paths not supported: {}",
	558	path.to_string_lossy()
	559	)
	560	}
	561	Ok(())
	562	}
	563
	564	```
	565
	566	# Credits
	567
	568	`pangler v0.2.0` was created by Federico Igne (git@federicoigne.com) and available at [`https://git.dyamon.me/projects/pangler`](https://git.dyamon.me/projects/pangler).
	569
	570	```{#Cargo.toml .toml}
	571	[package]
	572	name = "pangler"
	573	version = "0.2.0"
	574	edition = "2021"
	575
	576	[dependencies]
	577	<<dependencies>>
	578	```


diff --git a/util/weaver.lua b/util/weaver.lua new file mode 100644 index 0000000..1159988 --- /dev/null +++ b/util/weaver.lua
@@ -0,0 +1,34 @@
	1	if FORMAT:match 'latex' then
	2	-- Setting custom `listings` style
	3	function Meta(m)
	4	m["header-includes"] = pandoc.MetaBlocks({pandoc.RawBlock("latex",[[
	5	\lstdefinestyle{weaver}{
	6	basicstyle=\small\ttfamily,
	7	backgroundcolor=\color{gray!10},
	8	xleftmargin=0.5cm,
	9	numbers=left,
	10	numbersep=5pt,
	11	numberstyle=\tiny\color{gray},
	12	captionpos=b
	13	}
	14	\lstset{style=weaver}
	15	]])})
	16	return m
	17	end
	18	function CodeBlock(b)
	19	-- Remove `path` attribute and merge it with `id`
	20	if b.attributes.path and b.identifier then
	21	b.identifier = b.attributes.path .. b.identifier
	22	b.attributes.path = nil
	23	end
	24	-- Add ID to caption
	25	if b.identifier then
	26	if b.attributes.caption then
	27	b.attributes.caption = b.identifier .. ": " .. b.attributes.caption
	28	else
	29	b.attributes.caption = b.identifier
	30	end
	31	end
	32	return b
	33	end
	34	end