aboutsummaryrefslogtreecommitdiff
path: root/README.md
blob: 69ca7325b98a933318b11f0f5a405494389ae14c (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
---
title: "Pangler: literate programming in Pandoc"
author: Federico Igne
date: \today
...

This documents describes the logic and design of `pangler`, a minimal tangler for literate programming using the [Pandoc Markdown syntax](https://pandoc.org/MANUAL.html#pandocs-markdown).

[Literate Programming](https://en.wikipedia.org/wiki/Literate_programming) (LP) is a programming paradigm that emphasize the natural flow of thoughts that the programmer experiences when writing software.
The paradigm can be seen as "documentation first" and the focus is on "human-to-human" communication.
The produced document is a text-based prose document describing the logic and the design of the program, interspersed with snippets of code that form the final software.

Given an LP document, one can either extract the *tangled* code (with a "tangler") or generate its documentation, "woven" from the literate input source (with a "weaver").

In this case, [Pandoc](https://pandoc.org) is a very good weaver, supporting the generation of different document formats from a Markdown source.
This document is an attempt at providing a tangler working alongside Pandoc.
`pangler` is itself written in Pandoc Markdown format and can be generated from this document using itself.

# Literate programming with `pangler`

`pangler` uses two main features provided by the Pandoc Markdown syntax, which are not necessarily present in other Markdown flavours:

1. [`backtick_code_blocks`](https://pandoc.org/MANUAL.html#extension-backtick_code_blocks) for writing fenced snippets of code, and
2. [`fenced_code_attributes`](https://pandoc.org/MANUAL.html#extension-fenced_code_attributes) for adding arbitrary HTML attributes, classes and ID to a snippet of code.

## Writing programs

In the following, we indicate a *literate program* as a markdown file written in Pandoc Markdown syntax.

A minimal block of code, recognized by `pangler`, with ID `identifier` is

~~~
```{#identifier}
[code snippet]
```
~~~

Code blocks can contain **code macros** of the form `<<identifier>>` where `identifier` is a valid code block ID.
Code macros will be recursively substituted by the corresponding code snippet during [code generation](#tangling-generating-the-source-files).
A code macro needs to be placed in its own line, with an optional (whitespace) indentation, used during code generation to indent the code snippet.

Additional attributes and classes can be added to a code block, as well;
the language of the code snippet can be provided and is useful to enable correct syntax highlighting.

~~~
```{#identifier .python}
[python code snippet]
```
~~~

An identifier can also be a file name matching the following regex

```{#regex_path .rust}
static ref PATH: Regex =
  Regex::new(
    r"^(?:[[:word:]\.-]+/)*[[:word:]\.-]+\.[[:alpha:]]+$"
  ).unwrap();
```

In that case the code block is considered a valid **entry point** for the generation of a file with that name.
The code block defines the content of the new file.

~~~
```{#file.py .python}
[python main file]
```
~~~

File names can be generated in subfolders using the `path` attribute.
The following code block determines the content of file `path/to/file.py`.

~~~
```{#file.py .python path="path/to/"}
[python main file]
```
~~~

This path is relative to the current working directory, unless [the `-o`/`--output` flag is used](#command-line-interface).

Code blocks without an ID are ignored.

```{#code_block_gathering .rust}
if !id.is_empty() {
  let key = {
    let path = attrs.iter().find(|(k,_)| k == "path");
    if let Some(path) = path {
      format!("{}{}", path.1, id) 
    } else {
      id.to_string()
    }
  };
  <<code_block>>
} else {
  eprintln!("Ignoring code block without ID:");
  eprintln!("{}", indent(Cow::from(code),4));
}
```

Code blocks are processed in order.
By default, if an identifier is already defined, the code block is appended to the current corresponding value.

Use the `override` class in the code block definition to cause the block to override the previous entry with the same key, if this exists.

~~~
```{#identifier .python .override}
[Python code snippet]
```
~~~

This is handled in code as follows

```{#code_block .rust}
if clss.iter().any(|c| c == "override") {
  blocks.insert(key, Cow::from(code));
} else {
  blocks.entry(key)
        .and_modify(|s| {
          *s += "\n";
          *s += Cow::from(code)
        })
        .or_insert(Cow::from(code));
}
```

## Tangling: generating the source files

To bootstrap the tangling process, a tangled version of the program is provided alongside the literate version.

The executable can be compiled from the root of the project with

```sh
cargo build --release
```

From now on you can make changes to the `README.md` file and use your latest compiled version of `pangler` to tangle and compile it.

## Weaving: generating the documentation

As explained above we use [`pandoc`](https://pandoc.org/) as a weaver.
Run the following command to generate a PDF file for this document

```sh
pandoc --to latex \
       --listings \
       --number-sections \
       --lua-filter=util/weaver.lua \
       --output pangler.pdf \
       README.md
```

The Lua filter `util/weaver.lua` is provided to handle custom `pangler` attributes during the PDF generation via the \LaTeX\ engine.

## Integration with (Neo)Vim

(Neo)Vim supports code highlighting inside Markdown blocks, when the programming language is provided among its attributes.
Add the following to your config file to enable code highlighting for a specific set of languages

```vimscript
let g:markdown_fenced_languages =
  ['python','rust','scala']
```

# Command Line Interface

`pangler` offers a very simple command line interface.
For an overview of the functionalities offered by the tool run

```sh
pangler --help
```

`pangler` uses the `clap` library to parse command line arguments

```{#dependencies .toml}
clap = { version = "3.1", features = ["derive"] }
```

```{#uses .rust}
use clap::Parser;
```

using the [Derive API](https://github.com/clap-rs/clap/blob/v3.1.18/examples/tutorial_derive/README.md) to define the exposed functionalities.
The `struct` holding the CLI information is defined as follow

```{#config .rust}
/// A tangler for Literate Programming in Pandoc
#[derive(Parser, Debug)]
#[clap(author, version, about, long_about = None)]
struct Config {
  <<config_depth>>
  <<config_output>>
  <<config_input>>
}
```

and the arguments are parsed as

```{#config_parse .rust}
let config = Config::parse();
```

`pangler` accepts a sequence of files that will be parsed, code will be collected and used to build the final program.
Note that the order of the file provided on the CLI is important when using the [overriding functionality](#writing-programs).

```{#config_input .rust}
/// Input files
input: Vec<PathBuf>,
```

By default, files are generated in the current working directory.

```{#constants .rust}
const BASE: &str = "./";
```

This behaviour can be overridden using the `-o`/`--output` flag.

```{#config_output .rust}
/// Base output directory [default: './']
#[clap(short, long)]
output: Option<PathBuf>,
```

Finally, recursive substitution of blocks can lead to an infinite loop.
By default, `pangler` will stop after 10 substitution iterations, but this parameter can be changed with the `-d`/`--depth` flag.

```{#config_depth .rust}
/// Maximum substitution depth
#[clap(short, long, default_value_t = 10)]
depth: u32,
```

# The program

The program is structured as a single Rust file with the following being the main entry point of the program

```{#main.rs .rust path="src/"}
<<uses>>

<<constants>>

<<config>>

<<types>>

<<functions>>

fn main() -> Result<()> {
  <<config_parse>>
  <<pandoc_setup>>
  Ok(())
}
```

## Pandoc

We are using [`rust-pandoc`](https://github.com/oli-obk/rust-pandoc) and [`pandoc-ast`](https://github.com/oli-obk/pandoc-ast) to interact with `pandoc` from Rust.

```{#dependencies .toml}
pandoc = "0.8"
pandoc_ast = "0.8"
```

```{#uses .rust}
use pandoc::{
  InputFormat,InputKind,OutputFormat,OutputKind,Pandoc
};
use pandoc_ast::Block;
```

First we need to initialize a new `Pandoc` struct

```{#pandoc_setup .rust}
let mut pandoc = Pandoc::new();
```

and set up the input parameters.
The input is a sequence of Markdown files passed as config options from the CLI.

```{#pandoc_setup .rust}
pandoc.set_input(InputKind::Files(config.input));
pandoc.set_input_format(InputFormat::Markdown, vec![]);
```

The output is piped to stdout in JSON format.

```{#pandoc_setup .rust}
pandoc.set_output(OutputKind::Pipe);
pandoc.set_output_format(OutputFormat::Json, vec![]);
```

In this way, we will be able to pipe the output into a Pandoc filter that will collect the code snippets and build the codebase for us.

```{#pandoc_setup .rust}
pandoc.add_filter(
  move |json| pandoc_ast::filter(json,
    |pandoc| {
      <<pandoc_filter>>
    }
  )
);
pandoc.execute().unwrap();
```

## Pandoc filters

Pandoc allows for the definition of [custom filters](https://pandoc.org/filters.html) to change the abstract syntax tree of a document.

In this case we use a filter to collect code snippets from the input Markdown text into a `HashMap`, mapping code block identifiers to code block snippets.

```{#uses .rust}
use std::borrow::Cow;
use std::collections::HashMap;
```

```{#types .rust}
type Blocks<'a> = HashMap<String,Cow<'a,str>>;
```

Code blocks are wrapped into a [`Cow`](https://doc.rust-lang.org/stable/std/borrow/enum.Cow.html), i.e., a "copy-on-write" smart pointer, to avoid string duplication, unless strictly necessary.

We iterate over all code blocks, along with their IDs, classes and attributes, collecting them

```{#pandoc_filter .rust}
let mut blocks: Blocks = HashMap::new();
pandoc.blocks.iter().for_each(|block|
  if let Block::CodeBlock((id,clss,attrs), code) = block {
    <<code_block_gathering>>
  }
);
```

And then we build the source code, making sure to cut off recursive code generation with depth larger than `config.depth`.

```{#pandoc_filter .rust}
build(&config.output, &blocks, config.depth);
```

The filter returns the Pandoc JSON unchanged.

```{#pandoc_filter .rust}
pandoc
```

## Source code generation

In order to build the source code from the gathered code block snippets, we need to recursively substitute *code macros* of the form `<<identifier>>` with the corresponding code block.

Code macros are matched with the following regex

```{#regex_macro .rust}
static ref MACRO: Regex =
  Regex::new(
    r"(?m)^([[:blank:]]*)<<([^>\s]+)>>"
  ).unwrap();
```

Note that, when matching the code macro, we keep track of its indentation as well, in order to properly indend code.

Given a code macro, the following closure will compute the substituting block of code, properly indented.
The input `Captures` structure is a vector with the regex capture groups, i.e., indentation and macro identifier, along with the full match in the first position.

In case we reach the maximum allowed depth we truncate code block substitution and notify the user that something might not have been generated as expected.

```{#macro_closure .rust}
|caps: &Captures| {
  if current_depth < max_depth {
    let block = blocks
      .get(&caps[2])
      .expect("Block not present")
      .clone();
    indent(block, caps[1].len())
  } else {
    eprintln!("Reached maximum depth, \
               output might be truncated.\n\
               Increase `--depth` accordingly.");
    Cow::Owned(String::from(""))
  }
}
```

As explained above, the building process iterates over all collected blocks and detects relevant entry points (files to generate) to start the recursive macro substitution.

```{#functions .rust}
fn build(
  base: &Option<PathBuf>,
  blocks: &Blocks,
  max_depth: u32
) {
  <<regex_definition>>
  blocks
    .iter()
    .for_each(|(path,code)| if PATH.is_match(path) { 
      <<code_generation>>
    })
}

```

### Recursive macro substitution

The code generating algorithm went through multiple iterations and showed some interesting details of using `Cow`s.

```{#code_generation .rust}
let mut current_depth = 0;
let mut code = code.clone();
while MACRO.is_match(&code) {
  code = MACRO.replace_all(
    &code,
    <<macro_closure>>
  );
  current_depth += 1;
}
```

The problem with this version is that, due to how `Cow` works, the value returned by `replace_all` cannot live longer than the borrowed `code` passed as a parameter.
This is because the function returns a reference to `code` (`Cow::Borrowed`) if no replacement takes place, so for the returned value to be valid, `code` still needs to be available.
But here, `code` gets overridden right away, so, in principle, if no replacement takes place `code` gets overridden by a reference to itself (losing data).

However, note that this doesn't happen in practice (but the compiler doesn't know about this) because the `replace_all` function is applied as long as some replacement is possible (`while`
condition).
In other words, all calls to `replace_all` always return an `Cow::Owned` value.

The problem is solved by a clever use of pattern matching

```{#code_generation .rust .override}
let mut current_depth = 0;
let mut code = code.clone();
while let Cow::Owned(new_code) = MACRO.replace_all(
  &code,
  <<macro_closure>>
) {
  code = Cow::from(new_code);
  current_depth += 1;
}
```

In this case, the matched `Cow::Owned` is not concerned by any lifetime (the type is `Cow<'_,str>`) of the borrowed value `code`.
Moreover `code` takes ownership of `new_code: String` using the `Cow::from()` function.
No heap allocation is performed, and the string is not copied.

Finally, we write the code to file

```{#code_generation .rust}
let file = base
  .clone()
  .unwrap_or(PathBuf::from(BASE))
  .join(path);
write_to_file(file, &code)
  .expect("Unable to write to file");
```

## Additional details

### Code indentation

When (positive) code indentation is required, the processed block of code is indented by `indent`.

```{#indent_prefix .rust}
let prefix = format!("{:indent$}", "");
```

Each line is then `prefix`ed separately and the result is returned.

```{#functions .rust}
fn indent<'a>(
  input: Cow<'a,str>,
  indent: usize
) -> Cow<'a,str> {
  if indent > 0 {
    <<indent_prefix>>
    let size = input.len() + indent*input.lines().count();
    let mut output = String::with_capacity(size);
    input.lines().enumerate().for_each(|(i,line)| {
      if i > 0 {
        output.push('\n');
      }
      if !line.is_empty() {
        output.push_str(&prefix);
        output.push_str(line);
      }
    });
    Cow::Owned(output)
  } else {
    input
  }
}

```

Note that, if no indentation is required (i.e., `indent` is equal to 0), no additional allocation is performed, and the `input` is returned as is.

### RegEx matching

`pangler` uses the `regex` library to perform regular expression matching and substitution.
Moreover, the library suggests the use of `lazy_static` to ensure that the regexes used are compiled exactly once per execution.


```{#dependencies .toml}
lazy_static = "1.4"
regex = "1.5"
```

```{#uses .rust}
use lazy_static::lazy_static;
use regex::{Captures,Regex};
```

We wrap the regex definition in a `lazy_static` macro

```{#regex_definition .rust}
lazy_static! {
  <<regex_path>>
  <<regex_macro>>
}
```

### Writing to file

Writing to file is an operation performed using the Rust support for OS operations from the standard library.

```{#uses .rust}
use std::fs;
use std::io::Result;
use std::path::PathBuf;
```

First, all necessary parent directories of `path` are created

```{#parent_directory_creation .rust}
fs::create_dir_all(path.parent().unwrap())?;
```

and then the `content` is written to the file provided by 
    
```{#write_to_file .rust}
fs::write(path, content)?;
```

We perform a check on `path` and only write the content to the file if the path is relative to the current working directory.

```{#functions .rust}
fn write_to_file(
  path: PathBuf, content: &str
) -> std::io::Result<()> {
  if path.is_relative() {
    <<parent_directory_creation>>
    <<write_to_file>>
  } else { 
    eprintln!(
      "Absolute paths not supported: {}",
      path.to_string_lossy()
    )
  }
  Ok(())
}

```

# Credits

`pangler` was created by Federico Igne (git@federicoigne.com) and available at [`https://git.dyamon.me/projects/pangler`](https://git.dyamon.me/projects/pangler).

```{#Cargo.toml .toml}
[package]
name = "pangler"
version = "0.3.0"
edition = "2021"

[dependencies]
<<dependencies>>
```