LondonScrapers/README.MD
2026-04-07 19:02:03 -04:00

106 lines
5.6 KiB
Markdown

# City of London Scrapers
This is a collection of shell script scrapers that I have written for the City of London website. These are meant for my own use, so comments and code quality is lacking. If you need something scraped, or want to understand why/how I'm scraping the city, please reach out by email at "contact@lillianskinner.ca". Cheers.
## websites.csv
`websites.csv` holds an index of eScribe domains to crawl. The format is as follows:
```
"<eScribe domain>","<output directory>","<leave empty, this entry is used by other tools>"
```
As an example, an entry might look like this:
```
"https://pub-london.escribemeetings.com/", "LondonArchive", ""
```
Files will be output to `./LondonArchive/Meetings/`.
YOU MUST HAVE `websites.csv` FOR ALL ESCRIBE SCRAPERS!
## Scrape eScribe meetings (SCRAPE_MEET.SH)
This bash script will scrape meetings from the eScribe meetings platform. There is a variable set called `SUPPORT_PAST`. If `SUPPORT_PAST="TRUE"`, meetings older than 2 months will be downloaded. Otherwise, they will be skipped.
Don't ask why "TRUE" is a string and not a boolean...
The basic structure of the output files is:
```
./<output directory in websites.csv>/Meetings/<board/committee name>/<year>/<mm-dd>/
|- <agenda>.pdf
|- <minutes>.pdf
\- Attachments/
|- <attachment 1>.pdf
|- <attachment 2>.pdf
\- etc etc
```
## Scrape eScribe JSONs (SCRAPE_ESCRIBE.SH)
This bash script will scrape meeting JSON lists from the eScribe meetings platform. Each JSON will be split into batches of 50 meetings.
The basic structure of the output files is:
```
./output directory in websites.csv/Meetings (JSON)/<board/committee name>/
|- <board/committee name>_0.json
|- <board/committee name>_1.json
\- etc etc
```
## Scrape planning applications (SCRAPE_PLAN.SH)
This bash script will scrape planning applications from London's website at: https://london.ca/business-development/planning-development-applications/planning-applications
The basic structure of the output files is:
```
./LondonArchive/Planning Applications/<application type>/
\- <file no.> - 123 Example St/
|- Info.txt
\- Attachments/
|- <attachment 1>.pdf
|- <attachment 2>.pdf
\- etc etc
```
## Scrape London open data (SCRAPE_OPEN.SH)
This bash script will scrape London's ArcGIS open data platform, including maps and statistics. The server is at: https://maps.london.ca/server/rest/services/OpenData
The basic structure of the output files is:
```
./LondonArchive_OpenData/
|- <statistics 1>.xlsx.7z
|- <statistics 2>.csv.7z
\- Maps/
|- <map 1>.7z
|- <map 2>.7z
\- etc etc
```
## Scrape London Transit Commission meetings (SCRAPE_LTC.SH)
This bash script will scrape LTC meetings from their wordpress site at: https://www.londontransit.ca/agendas-and-minutes/
Attachments are downloaded as the HTML versions, converted to PDF. The original documents (linked from the agenda PDFs) may not always be OCRed, and the quality can be low.
The basic structure of the output files is:
```
./LondonArchive/LTC/<board/committee name>/<year>/<mm-dd>/
|- <agenda>.pdf
|- <minutes>.pdf
\- Attachments/
|- <attachment 1>.pdf
|- <attachment 2>.pdf
\- etc etc
```
## Scrape London Police Services meetings (SCRAPE_LPS.SH)
This bash script will scrape LPS meetings from their wordpress site at: https://londonpoliceserviceboard.com/board-meetings/
The basic structure of the output files is:
```
./LondonArchive/LPS/<board/committee name>/<year>/<mm-dd>/
|- <agenda>.pdf
|- <minutes>.pdf
\- Attachments/
|- <attachment 1>.pdf
|- <attachment 2>.pdf
\- etc etc
```