Go to file
2026-04-07 20:03:34 -04:00
default.html Upload files to "/" 2026-04-07 20:03:34 -04:00
README.MD Upload files to "/" 2026-04-07 20:03:34 -04:00
SCRAPE_ESCRIBE.SH Upload files to "/" 2026-04-07 19:49:40 -04:00
SCRAPE_LPS.SH Upload files to "/" 2026-04-07 18:57:47 -04:00
SCRAPE_LTC.SH Upload files to "/" 2026-04-07 19:49:40 -04:00
SCRAPE_MEET.SH Upload files to "/" 2026-04-07 19:49:40 -04:00
SCRAPE_OPEN.SH Upload files to "/" 2026-04-07 18:57:47 -04:00
SCRAPE_PLAN.SH Upload files to "/" 2026-04-07 19:49:40 -04:00
websites.csv Upload files to "/" 2026-04-07 19:49:40 -04:00

City of London Scrapers

This is a collection of shell script scrapers that I have written for the City of London website. These are meant for my own use, so comments and code quality is lacking. If you need something scraped, or want to understand why/how I'm scraping the city, please reach out by email at "contact@lillianskinner.ca". Cheers.

websites.csv

websites.csv holds an index of eScribe domains to crawl. The format is as follows:

"<eScribe domain>","<output directory>","<leave empty, this entry is used by other tools>"

As an example, an entry might look like this:

"https://pub-london.escribemeetings.com/", "LondonArchive", ""

Files will be output to ./LondonArchive/Meetings/.

YOU MUST HAVE websites.csv FOR ALL ESCRIBE SCRAPERS!

Scrape eScribe meetings (SCRAPE_MEET.SH)

This bash script will scrape meetings from the eScribe meetings platform. There is a variable set called SUPPORT_PAST. If SUPPORT_PAST=1 (true), meetings older than 2 months will be downloaded. Otherwise, they will be skipped.

The basic structure of the output files is:

./<output directory in websites.csv>/Meetings/<board/committee name>/<year>/<mm-dd>/
                                                                                   |- <agenda>.pdf
                                                                                   |- <minutes>.pdf
                                                                                   \- Attachments/
                                                                                                 |- <attachment 1>.pdf
                                                                                                 |- <attachment 2>.pdf
                                                                                                 \- etc etc

Scrape eScribe JSONs (SCRAPE_ESCRIBE.SH)

This bash script will scrape meeting JSON lists from the eScribe meetings platform. Each JSON will be split into batches of 50 meetings.

The basic structure of the output files is:

./output directory in websites.csv/Meetings (JSON)/<board/committee name>/
                                                                         |- <board/committee name>_0.json
                                                                         |- <board/committee name>_1.json
                                                                         \- etc etc

Scrape planning applications (SCRAPE_PLAN.SH)

This bash script will scrape planning applications from London's website at: https://london.ca/business-development/planning-development-applications/planning-applications

The basic structure of the output files is:

./LondonArchive/Planning Applications/<application type>/
                                                         \- <file no.> - 123 Example St/
                                                                                       |- Info.txt
                                                                                       \- Attachments/
                                                                                                     |- <attachment 1>.pdf
                                                                                                     |- <attachment 2>.pdf
                                                                                                     \- etc etc

Scrape London open data (SCRAPE_OPEN.SH)

This bash script will scrape London's ArcGIS open data platform, including maps and statistics. The server is at: https://maps.london.ca/server/rest/services/OpenData

The basic structure of the output files is:

./LondonArchive_OpenData/
                        |- <statistics 1>.xlsx.7z
                        |- <statistics 2>.csv.7z
                        \- Maps/
                               |- <map 1>.7z
                               |- <map 2>.7z
                               \- etc etc

Scrape London Transit Commission meetings (SCRAPE_LTC.SH)

This bash script will scrape LTC meetings from their wordpress site at: https://www.londontransit.ca/agendas-and-minutes/

Attachments are downloaded as the HTML versions, converted to PDF. The original documents (linked from the agenda PDFs) may not always be OCRed, and the quality can be low. The HTML --> PDF conversion needs the template page included at ./template/default.html.

The basic structure of the output files is:

./LondonArchive/LTC/<board/committee name>/<year>/<mm-dd>/
                                                         |- <agenda>.pdf
                                                         |- <minutes>.pdf
                                                         \- Attachments/
                                                                       |- <attachment 1>.pdf
                                                                       |- <attachment 2>.pdf
                                                                       \- etc etc

Scrape London Police Services meetings (SCRAPE_LPS.SH)

This bash script will scrape LPS meetings from their wordpress site at: https://londonpoliceserviceboard.com/board-meetings/

The basic structure of the output files is:

./LondonArchive/LPS/<board/committee name>/<year>/<mm-dd>/
                                                         |- <agenda>.pdf
                                                         |- <minutes>.pdf
                                                         \- Attachments/
                                                                       |- <attachment 1>.pdf
                                                                       |- <attachment 2>.pdf
                                                                       \- etc etc