90 lines
4.7 KiB
Markdown
90 lines
4.7 KiB
Markdown
# City of London Scrapers
|
|
This is a collection of shell script scrapers that I have written for the City of London website.
|
|
|
|
## websites.csv
|
|
|
|
`websites.csv` holds an index of eScribe domains to crawl. The format is as follows:
|
|
```
|
|
"<eScribe domain>","<output directory>","<leave empty, this entry is used by other tools>"
|
|
```
|
|
As an example, an entry might look like this:
|
|
```
|
|
"https://pub-london.escribemeetings.com/", "LondonArchive", ""
|
|
```
|
|
Files will be output to `./LondonArchive/Meetings/`.
|
|
|
|
YOU MUST HAVE `websites.csv` FOR ALL ESCRIBE SCRAPERS!
|
|
|
|
## Scrape eScribe meetings (SCRAPE_MEET.SH)
|
|
|
|
This bash script will scrape meetings from the eScribe meetings platform.
|
|
|
|
The basic structure of the output files is:
|
|
```
|
|
./<output directory in websites.csv>/Meetings/<board/committee name>/<year>/<mm-dd>/
|
|
|- <agenda>.pdf
|
|
|- <minutes>.pdf
|
|
\- Attachments/
|
|
|- <attachment 1>.pdf
|
|
|- <attachment 2>.pdf
|
|
\- etc etc
|
|
```
|
|
|
|
## Scrape eScribe JSONs (SCRAPE_ESCRIBE.SH)
|
|
|
|
This bash script will scrape meeting JSON lists from the eScribe meetings platform. Each JSON will be split into batches of 50 meetings.
|
|
|
|
The basic structure of the output files is:
|
|
```
|
|
./output directory in websites.csv/Meetings (JSON)/<board/committee name>/
|
|
|- <board/committee name>_0.json
|
|
|- <board/committee name>_1.json
|
|
\- etc etc
|
|
```
|
|
|
|
## Scrape planning applications (SCRAPE_PLAN.SH)
|
|
|
|
This bash script will scrape planning applications from London's website at: https://london.ca/business-development/planning-development-applications/planning-applications
|
|
|
|
The basic structure of the output files is:
|
|
```
|
|
./LondonArchive/Planning Applications/<application type>/
|
|
\- <file no.> - 123 Example St/
|
|
|- Info.txt
|
|
\- Attachments/
|
|
|- <attachment 1>.pdf
|
|
|- <attachment 2>.pdf
|
|
\- etc etc
|
|
```
|
|
|
|
## Scrape London Transit Commission meetings (SCRAPE_LTC.SH)
|
|
|
|
This bash script will scrape LTC meetings from their wordpress site at: https://www.londontransit.ca/agendas-and-minutes/
|
|
|
|
Attachments are downloaded as the HTML versions, converted to PDF. The original documents (linked from the agenda PDFs) may not always be OCRed, and the quality can be low.
|
|
|
|
The basic structure of the output files is:
|
|
```
|
|
./LondonArchive/LTC/<board/committee name>/<year>/<mm-dd>/
|
|
|- <agenda>.pdf
|
|
|- <minutes>.pdf
|
|
\- Attachments/
|
|
|- <attachment 1>.pdf
|
|
|- <attachment 2>.pdf
|
|
\- etc etc
|
|
```
|
|
|
|
## Scrape London Police Services meetings (SCRAPE_LPS.SH)
|
|
|
|
This bash script will scrape LPS meetings from their wordpress site at: https://londonpoliceserviceboard.com/board-meetings/
|
|
|
|
The basic structure of the output files is:
|
|
```
|
|
./LondonArchive/LPS/<board/committee name>/<year>/<mm-dd>/
|
|
|- <agenda>.pdf
|
|
|- <minutes>.pdf
|
|
\- Attachments/
|
|
|- <attachment 1>.pdf
|
|
|- <attachment 2>.pdf
|
|
\- etc etc
|
|
``` |