2017-12-18
In the Christmas of 2015, I received a Raspberry Pi from an older cousin. Everybody knows that the most difficult thing in a Raspberry Pi is to find a really neat use case (home media center doesn’t count) that does not involve purchasing more hardware or DIY.
Long story short, I started turning it into a server around February 2017, got hacked, learned a ton about networking, ports and Linux. In September 2017, the Pi as a git/ssh server was stable, and it worked, and all was good.
But…
During the research phase, I came across this website. Go on, take a look.
What you may see is a old random website.
What I saw was a new project: a personal website.
Well, this was not the first time the idea of a personal website had popped up - I had this idea latent in my mind since the 8th grade, and even toyed with Blogspot (or was it already Google Sites? circa 2011), but I had nothing to say. Also, it felt artificial, like the website wasn’t mine, and I always ended up deleting the drafts.
But when I found that website, I realised I could really have my own place, a tiny corner on the internet, like Dan Luu, Brian Krebs or Julia Evans.
So, one of the problems - how to do it properly - was solvable now. But the second part - what to write - remained unanswered. And life continued, 2nd year on University started, and the Pi was stable.
In Hacker News, sometimes, there exists a periodic convergence of posts. For example, sometimes someone posts about a new version of Go/Rust, it reaches front page, discussion ensues, somebody else posts a related link, and this can go on for a week (I’m a fairly new lurker).
In the last week of November, the theme was note taking and related tools. Personally, I had already tried my luck with agendas and notebooks during highschool, but it never sticked, partially because I could just memorize the homework, and everything else fitted in each class’ notebook; and when I reached University, I found a really good To Do app to remember time sensitive stuff - tests, projects, etc.
But I started facing a problem when I started trying to do projects on my own, I quickly realized that I needed to take notes of what I was doing, else I would forget all the little details of what I needed to do, for example, to configure the Pi again to that golden, working state.
At this point, the website idea came back to my mind - blog posts explaining the technical details, pitfalls, stuff tried, useful links, Unix history, hacker culture.
And so, this is my first page in a blog that does not exist - yet.
So, I wanted to make/host a website. My main goals were:
The stack that came to my mind was Pandoc (Markdown -> HTML) + (GNU) make; Markdown just feels natural to write, and I picked up make
during the summer (and when all you have is a hammer, everything looks like a nail).
So, I started by looking at the Pandoc part, and found this blog. This is basically what I had in mind (different style, though). Looking at the source code, we can see that Professor McDaniel uses a bash script to manage the conversion of all the Markdown files to HTML; using make
, we can avoid the need to re-convert the files that weren’t modified since the last run, and therefore reducing processing waste1.
Regarding the style, I generally like minimalism, both aesthetically and in terms of resource consumption; in a way, it is in the spirit of YAGNI. So, I started looking for stylesheets, and came across this stylesheet. It is minimalistic, and minified it is 3kb. In a word, it fits the bill, after some light customizations.
I also made the conscious decision of not having Javascript of any form. I do not like to be tracked when I browse the Internet; therefore, it seemed only right to keep my website clean.
De-bloated the CSS, reduced from 141 lines down to 5 selectors.
While making this website, I had to solve some problems / make a few decisions:
make
difficult to understand behaviourI mentioned above that, originally, I thought about hosting a website in my Raspberry Pi. During that year (2017) I realized that it was not a good idea having a public-facing webserver on the home network. Luckily, I discovered that my University provides a “web” service.
This “service” consists in having a folder called web
in the personal user account on the central SSH server, and whatever exists inside the folder gets served in a subdirectory of a subdomain. This was exactly what I was looking for: a simple way to get a some content online, leaving loads of freedom regarding on what and how to present.
The downsides are that I do not get to learn about TCP/IP on the pratical side, how to setup a webserver, differences between HTTP and HTTPS, and whether is it possible to serve the same content on the Internet and Gopher at the same time. Also, on a more pratical note, the environment has some constraints: disk quota of 10 GB2, no working cron system, CPU and RAM limited by the utilization of other users, and no possibility of using SSH keys instead of passwords3.
From my previous experience4, I already knew roughly what problems I would have with linking. In this case, because of the hosting conditions, the main issue is with absolute linking: because I am not in the root directory of the website, to link to my index.html
, for example, instead of first form, I need to use the second type.
The problem is that I do not want to hardcode “_BASE_DIR_” everywhere. What if I want to change to another hosting solution? I would have to change it everywhere5! To solve this problem, I took inspiration from Jekyll Variables. Basically, I wanted to have variables, use them, and then control the value of the variable during the make
process. To do this, I settled for what was the most simple way possible: to have “special” strings, and substitute them after generating HTML. The implementation for this is visible in the current Makefile: notice the make
variable PREPROCESSOR
6, that is actually a sed
command to replace text. So, if _
BASE_
DIR_
appears anywhere on the HTML page, it automatically gets replaced by the correct URL.
After having a working Makefile, I needed to add some kind of navigation, besides the navbar and footer. I first thought about making something similar to Professor McDaniel: have a principal, introductory page to a section, linking to posts. That would be not very flexible, and time-consuming to do in make
; so, I went with another idea: a sitemap, with all the links in one place. If I ever feel the need to create a presentation page, I could do it for each particular occasion.
Suddenly, I remembered that the tree
command can output HTML. I tried it, and it did output HTML, and well-formed one. Now, I only had to figure how to:
However, this proved really difficult to do directly from tree
output: I would need to insert HTML in the right places. After this, I tried to capture the body, convert that to Markdown, and reconvert back to HTML, to process it like a normal post. For some reason, the HTML -> Markdown conversion never goes well, and I had a backslash escaping problem. Also, because tree
relies on monospace fonts and whitespace to align properly, I initially thought about using code blocks; but pandoc
does not parse links inside code blocks. Then I thought about making my own version of tree with links and outputting Markdown; but it is impossible, because links are not parsed and converted inside code blocks.
So, I had no easy choice beside capturing the body, and giving it to pandoc
to make a standalone page with CSS, footer and navbar. Many complications appeared; long story short, ended up trying to learn sed
, and made this “one-liner”, technically a 3-liner (broken up for readability) :
echo '<div style="font-family: monospace, monospace">';
tree -H . --charset=ascii --noreport |
sed -n "\|<body>|,\|</body>|p" | sed -n "\|<p>|,\|</p>|p" |
sed -E "s|<h1(.)*h1>||g" |
sed -e 's/-- /--\ \;/g' -e 's/| /|\ \;\ \;\ \;/g' |
head --lines=-3;
echo "</div>"
Lines number 1 and 7 ensure that the block will have a monospaced font7.
Line 2 outputs an entire HTML page, and all we want is the first <p>
tag inside the body - so, in line 3, we restrict the output to the lines that are inside <body>
, and then to the lines between <p>
tags8. The mechanism used is the following:
-n
option: disable output, except if specifically toldp
command (inside sed script): print lines according to the selection; the selection has this syntax: "\|first pattern to match| , \|last pattern|"
Then, on the 4th line, the heading is removed. It was still in the pipes, because it is on the same line as the <p>
tag we want to catch. To remove it, I used a substitution command:
-E
option: Use extended regexs
command: Substitute the first expression - <h1(.)*h1>
- with the second expression - in this case, empty - and do it globally (to all the matches)In the 5th line, all the spaces are substituted with non-breaking spaces (
), to avoid the collapse into a single space when rendering. For this part, there are 2 situations that are critical for allignement and are not considered by tree
HTML output; both are considered on the same sed
instance.
Finally, on line number 6, there were extra elements in the end; fortunately, these extra elements always occuppy the last 3 lines, and so using head
was enough.
And that’s it! After all of this, all that’s left is to use pandoc
to take this HTML fragment, add the header and navbar, refer the CSS, and it is good to go. You can see this on the current Makefile for this website.
A common feature in blogs is having a list of the most recent posts on the first page. Turns out it was one of the most challenging pieces of this website to make.
So, the only time reference I can rely on is in the post date, located on what I would call front matter. The format is the following:
In HTML, this is in the div “header”, in an H3 tag. I decided that the script that would give me the list of posts would work with the Markdown files, because it is easier to get information about the post. Note that this is unlike the sitemap script, that works with the files already converted in HTML. Also, unlike the previous approach, we will output Markdown links, and we concatenate index.md
and this list of links, and then use pandoc
to get HTML.
So, in Shell9, how do you get the 3rd line of an arbitrary file? With a pipe, obviously:
head
gets the first 3 lines, and tail
gets the last out of those 3. And how do you get the 2nd string, if we divide using spaces? With a pipe:
cut
will separate the lines based on the provided delimter (-d ' '
), and select the fields 2
until the end of the line (-f 2-
). In this case, the first field is “date:
”.
Ok, now we can get a date out of an arbitrary post. But that only serves to sort posts; we need also the title and path of the file to make a link. So, to also get the post title, we need to consider the last 2 lines; however, the cut
logic remains the same, because we will want all space separated fields, except the first one, that is “title:
” in the first line.
Now, we need to get filenames. To do this, I actually just use find
and a for loop: find
gets me the files, and the for loop executes the line on all the files (simplified implementation due to pesky things like paths):
for i in $(find posts/ -type f); do
echo $(head -3 "$i" | tail -2 | cut -d ' ' -f 2-)
echo "$i"
done;
# Result:
# An exemplary title
# 2017-02-30
# posts/debug/001-Etc.md
But now, we are outputing 3 lines for each file and, from experience, I know the logic is easier if we have one line per file. So, changing from echo
to printf
:
for i in $(find posts/ -type f); do
printf "%s\t%s" \
"$(head -3 "$i" | tail -2 | cut -d ' ' -f 2- )" \
"$i"
done;
# Result:
# An exemplary title
# 2017-02-30 posts/debug/001-Etc.md
We still have problems, because line 3 outputs 2 lines, one for title, another for date. So, we need to “translate” characters - so we use tr
to turn newlines into tabs:
for i in $(find posts/ -type f); do
printf "%s\t%s" \
"$(head -3 "$i" | tail -2 | cut -d ' ' -f 2- | tr '\n' '\t' )" \
"$i"
done;
# Result:
# An exemplary title 2017-02-30 posts/debug/001-Etc.md
It would be nice if we had the date on the begining of the line, and then we would just need to sort
the lines. As a side effect, the title and filename would be in a good position to make Markdown style links. So, applying tac
to our ever growing mega-pipe, we invert the position of the lines with the title and date before we substitute newlines for tabs:
for i in $(find posts/ -type f); do
printf "%s\t%s" \
"$(head -3 "$i" | tail -2 | cut -d ' ' -f 2- | tac | tr '\n' '\t' )" \
"$i"
done;
# Result:
# 2017-02-30 An exemplary title posts/debug/001-Etc.md
Ok, now we have a way to get all the information we need from the posts. Now, we need to sort
it all, where all is a for loop; so, we need to group the for loop together (use {}
)10, and then pass the output to sort
in decrescent form (reverse, aka -r
). Also, I do not want all posts to appear; so, lets limit it to the first 10 (for example):
{ for i in $(find posts/ -type f); do
printf "%s\t%s" \
"$(head -3 "$i" | tail -2 | cut -d ' ' -f 2- | tac | tr '\n' '\t' )" \
"$i"
done; } | sort -r | head -10
# Result:
# 2017-04-31 Another possible title posts/projects/002-Other.md
# 2017-02-30 An exemplary title posts/debug/001-Etc.md
Ok, now what I want is this type of links11:
- 2017-04-31 [Another possible title](BASE_DIR/posts/projects/002-Other.html)
There are several transformations at play:
-
”[
”](BASE_DIR/
”.md
” to “.html)
”The sed
command to do this at once is:
You can see that the commands that deal with tabs have the regex quantifier +
, because sometimes an extra tab will sneak in.
Putting it all together:
{ for i in $(find posts/ -type f); do
printf "%s\t%s" \
"$(head -3 "$i" | tail -2 | cut -d ' ' -f 2- | tac | tr '\n' '\t' )" \
"$i"
done; } | sort -r | head -10 | sed -E -e 's|^|- |' -e 's|\t+| [|' \
-e 's|\t+|](BASE_DIR/|' -e 's|.md|.html\)|'
# Result:
# - 2017-04-31[Another possible title](BASE_DIR/posts/projects/002-Other.md)
# - 2017-02-30[An exemplary title](BASE_DIR/posts/debug/001-Etc.html)
The problem of recompiling the whole website is that it can exponentially get worse, and never gets better. For example, I was tasked to build a website to my University’s CTF group, STT; it uses Jekyll. On its own, Jekyll is a really fine piece of software, with goodies like collections and the Liquid Templating Engine. When it had only 3 pages, it took like 1s to convert and serve in my Chromebook (4GB of RAM, quad-core Intel); but now, with a tag and author system, and 10 posts, it already takes more than 6 seconds. In a year, probably I won’t be able to compile in less than 1 minute. (I am parcially guilty - I know were the for loops iterating over all the pages are :P ).↩︎
It was initially at 300 MB, but has increased over time. Also, note that this 10 GB limit is on the total home partition, and not only to the size of the web
folder.↩︎
The storage of the students is kept in an AFS file system, and it’s mounted to a set of 5 machines, and we SSH into one of these 5 machines. Because the home directory of the user trying to log in may not be mounted, and to avoid the overhead of having to access the .ssh folder of each user on any attempt to log in, using SSH keys is impossible / disabled.↩︎
I “created” the structure for this Jekyll website↩︎
I could just use sed
to change all the references. Not difficult, but not elegant either…↩︎
I know, technically a post-processor… Will fix one day↩︎
Specifying monospace twice is not a typo, see this explanation↩︎
What I really wanted was a non-greedy match; this solution includes extra <p>
sections↩︎
If you have any doubts on any of the following commands or shell constructs, I suggest you to read the man
page of the utility in question.↩︎
Actually, we do not need to group the for loop, because the for loop acts like a group (only tested this after the fact)↩︎
Remember that absolute links have a catch↩︎