I’m an avid reader and for the past year, I have been trying to up my creative writing game. Part of it is organising and pushing through books in the field I want to write on, so that I can understand the current trends better. This is of course just my mental excuse to keep buying Fantasy and Science Fiction books pretending it is for research when in reality it is just because I love them and want to read them.
As an author and human being possessing empathy, I don’t like Amazon very much. I still use them when it makes sense, but I try avoiding their business when I can, specially in areas related to books and writing. Because of that, I migrated from goodreads to The StoryGraph once it became open for beta. I’m quite happy with The StoryGraph and one of my main uses is to keep my list of to-be-read books (aka TBR). Every time I’m reading reviews, or watching videos on booktube, and a book calls up to me, I end up placing it in my TBR. If part of your daily routine is checking out what is going around in sci fi and fantasy book publishing, you’ll soon end up with a huge list of books that you kinda want to check out eventually.
There is nothing I can say against The StoryGraph, I’m enjoying it very much and my own reasons for this post are unrelated to the good quality of their service. I’m a fan, and I can see myself paying for it as soon as my trial ends (they have a free plan, I’m just happy with it and want to contribute with my wallet).
Owning your own platform
Owning your own data is a principle I learned from the IndieWeb that appeals to me in many levels. I’ve had a presence online since the early days of the Internet in Brazil, when our ISP was actually our BBS. Unfortunately, I didn’t care about my data and multiple iterations of my blog, my posts, my photos, have all been lost as whatever services I was using back then folded. I’m done with losing my stuff, I want to be able to either publish things on my own platform and syndicate elsewhere, or at least be able to pick content from third-party platforms and merge it back into my site.
That is what is happening here, I’m using The StoryGraph to manage my TBR list, but I want that list to have a copy on my own website so that if I ever decide to leave The StoryGraph for any reason, I don’t lose my TBR. I also want a lightweight, cacheable, webpage with my TBR so that I can easily access it from bookshops. Some of the shops I got here in London are in basements and often there is no Internet or carrier connection available. More than once, I’ve been on a shop and couldn’t remember the name of a book or author I wanted to check out. This will help me access my TBR with ease. By the time I was done with the procedure outlined on this post, I had a tbr.html at root level on my site, it looks like this:
One of the most important features I’ve added to that page—and that is not shown on the screenshot—is a huge emphasised text saying that I already own a book. More than once, I’ve purchased the same book twice because I forgot I got it months earlier in some bundle or sale.
Extracting data from The StoryGraph
Now, this is the part where one needs to make a choice. From my subjective point-of-view, there are basically three ways of things when you’re dealing with a small personal project:
- The Correct Way, which is often boring...
- The Wrong Way, but It Works™...
- The Fun Way, which often makes observers thing wtf...
Some people will adopt the bleeding-edge best practices and fancy stuff when doing small personal projects, as if they’re working on the world’s next unicorn. That is quite right and serves them well, it is just not me. I enjoy programming as a hobby way more than I enjoy doing it professionally. On my personal projects, I want to treat it as a toy, something I play with because it is fun, best practices and common patterns be damned, this is my fun time and I’ll mutate any prototype that looks funny in my direction because I can.
So, instead of going the boring way with an industry-approved, battle-tested, robust scrapping library. I decided to go with a language I have a lot of fun with: AppleScript.
Why you need a scrapper anyway?!
Oh, I didn’t mention, sorry. Well, as far as I can tell, The StoryGraph has no API for developers. To get my TBR out of it, I’ll need to scrape the DOM.
AppleScript is this fun language that ties into the Apple Events IPC. It is how we used to automate the hell out of our macs when classic MacOS was in its prime time, and it is still a very good way to do stuff on current macOS. It allows you to construct your own workflows and tie separate apps together not unlike UNIX people do at the command-line with pipes but much more fun.
My browser of choice is Firefox, but unfortunately Firefox support for AppleScript is a joke. It basically doesn’t exist. On the other hand, Safari has a lot of AppleScript features, as can be seen in its AppleScript Dictionary entry:
Scripting Safari is quite easy, you can basically automate all the workflow with its built-in API. And for the things that you have no API for, you can script “System Events” to simulate GUI interaction. As an example, the script below opens a webpage using Safari and gets it’s author from the meta tags:
How to extract the TBR
The TBR page carries almost all the information I want, except for the ISBN which is on each book detail page when available. A loop is used to iterate over each book, opening it’s book detail page and scraping for the ISBN and Owned status.
This script was created iteratively, I just opened the AppleScript Editor and typed away until it worked to my satisfaction, no plan, no design. After I got all the data I wanted into an AppleScript Record List, I decided to output it as a JSON using some handy utility function in AppleScriptObjC I found online. My original plan was to process this JSON using Pollen/Racket which is what powers this blog, but then I thought, why don’t I generate a Pollen page directly from AppleScript? And that is what I did in the end. The output of running that script is a JSON and a Pollen page.
The main challenge was actually writing utf-8 compliant text files. I think that my mistake was actually in the line-endings, I was thinking in terms of mac classic and used the wrong line-endings. I used CRs instead of LFs. This caused most tools to treat the generated files wrong. It took me a while to sort this out. I think half of the time was spent trying to figure out why my generated Pollen markdown file was not working.
I had a ton of fun building this. Watching AppleScript drive Safari feels like being in a hacker movie when the hackers invade the villains computer and everything starts moving on its own or whatever. I enjoy doing these little toys. Don’t get me wrong, this is not a product, this is not good AppleScript or JS code. This is just a script that I can run and update my TBR on my website.
The most important part is that I enjoyed the process of building this, it was playful, and in the end I have something that is useful to me. I’ve created a Gist on Github with the complete script if you want to check it out. It is horrifying but it works.
My next steps is just scheduling this script to run weekly either using Automator or a cronjob.
If you want to leave this page with some platitude or words of wisdom, then remember that it is OK to just have fun with programming. That you don’t need to build all your code as if you were coding for your job. Programming is also a playful activity, and you can have fun with it too.