Add README.MD
This commit is contained in:
parent
8752014878
commit
82b0a9b8e0
90
README.MD
Normal file
90
README.MD
Normal file
@ -0,0 +1,90 @@
|
||||
# City of London Scrapers
|
||||
This is a collection of shell script scrapers that I have written for the City of London website.
|
||||
|
||||
## websites.csv
|
||||
|
||||
`websites.csv` holds an index of eScribe domains to crawl. The format is as follows:
|
||||
```
|
||||
"<eScribe domain>","<output directory>","<leave empty, this entry is used by other tools>"
|
||||
```
|
||||
As an example, an entry might look like this:
|
||||
```
|
||||
"https://pub-london.escribemeetings.com/", "LondonArchive", ""
|
||||
```
|
||||
Files will be output to `./LondonArchive/Meetings/`.
|
||||
|
||||
YOU MUST HAVE `websites.csv` FOR ALL ESCRIBE SCRAPERS!
|
||||
|
||||
## Scrape eScribe meetings (SCRAPE_MEET.SH)
|
||||
|
||||
This bash script will scrape meetings from the eScribe meetings platform.
|
||||
|
||||
The basic structure of the output files is:
|
||||
```
|
||||
./<output directory in websites.csv>/Meetings/<board/committee name>/<year>/<mm-dd>/
|
||||
|- <agenda>.pdf
|
||||
|- <minutes>.pdf
|
||||
\- Attachments/
|
||||
|- <attachment 1>.pdf
|
||||
|- <attachment 2>.pdf
|
||||
\- etc etc
|
||||
```
|
||||
|
||||
## Scrape eScribe JSONs (SCRAPE_ESCRIBE.SH)
|
||||
|
||||
This bash script will scrape meeting JSON lists from the eScribe meetings platform. Each JSON will be split into batches of 50 meetings.
|
||||
|
||||
The basic structure of the output files is:
|
||||
```
|
||||
./output directory in websites.csv/Meetings (JSON)/<board/committee name>/
|
||||
|- <board/committee name>_0.json
|
||||
|- <board/committee name>_1.json
|
||||
\- etc etc
|
||||
```
|
||||
|
||||
## Scrape planning applications (SCRAPE_PLAN.SH)
|
||||
|
||||
This bash script will scrape planning applications from London's website at: https://london.ca/business-development/planning-development-applications/planning-applications
|
||||
|
||||
The basic structure of the output files is:
|
||||
```
|
||||
./LondonArchive/Planning Applications/<application type>/
|
||||
\- <file no.> - 123 Example St/
|
||||
|- Info.txt
|
||||
\- Attachments/
|
||||
|- <attachment 1>.pdf
|
||||
|- <attachment 2>.pdf
|
||||
\- etc etc
|
||||
```
|
||||
|
||||
## Scrape London Transit Commission meetings (SCRAPE_LTC.SH)
|
||||
|
||||
This bash script will scrape LTC meetings from their wordpress site at: https://www.londontransit.ca/agendas-and-minutes/
|
||||
|
||||
Attachments are downloaded as the HTML versions, converted to PDF. The original documents (linked from the agenda PDFs) may not always be OCRed, and the quality can be low.
|
||||
|
||||
The basic structure of the output files is:
|
||||
```
|
||||
./LondonArchive/LTC/<board/committee name>/<year>/<mm-dd>/
|
||||
|- <agenda>.pdf
|
||||
|- <minutes>.pdf
|
||||
\- Attachments/
|
||||
|- <attachment 1>.pdf
|
||||
|- <attachment 2>.pdf
|
||||
\- etc etc
|
||||
```
|
||||
|
||||
## Scrape London Police Services meetings (SCRAPE_LPS.SH)
|
||||
|
||||
This bash script will scrape LPS meetings from their wordpress site at: https://londonpoliceserviceboard.com/board-meetings/
|
||||
|
||||
The basic structure of the output files is:
|
||||
```
|
||||
./LondonArchive/LPS/<board/committee name>/<year>/<mm-dd>/
|
||||
|- <agenda>.pdf
|
||||
|- <minutes>.pdf
|
||||
\- Attachments/
|
||||
|- <attachment 1>.pdf
|
||||
|- <attachment 2>.pdf
|
||||
\- etc etc
|
||||
```
|
||||
Loading…
Reference in New Issue
Block a user