diff --git a/README.MD b/README.MD new file mode 100644 index 0000000..46b484b --- /dev/null +++ b/README.MD @@ -0,0 +1,90 @@ +# City of London Scrapers +This is a collection of shell script scrapers that I have written for the City of London website. + +## websites.csv + +`websites.csv` holds an index of eScribe domains to crawl. The format is as follows: +``` +"","","" +``` +As an example, an entry might look like this: +``` +"https://pub-london.escribemeetings.com/", "LondonArchive", "" +``` +Files will be output to `./LondonArchive/Meetings/`. + +YOU MUST HAVE `websites.csv` FOR ALL ESCRIBE SCRAPERS! + +## Scrape eScribe meetings (SCRAPE_MEET.SH) + +This bash script will scrape meetings from the eScribe meetings platform. + +The basic structure of the output files is: +``` +.//Meetings//// + |- .pdf + |- .pdf + \- Attachments/ + |- .pdf + |- .pdf + \- etc etc +``` + +## Scrape eScribe JSONs (SCRAPE_ESCRIBE.SH) + +This bash script will scrape meeting JSON lists from the eScribe meetings platform. Each JSON will be split into batches of 50 meetings. + +The basic structure of the output files is: +``` +./output directory in websites.csv/Meetings (JSON)// + |- _0.json + |- _1.json + \- etc etc +``` + +## Scrape planning applications (SCRAPE_PLAN.SH) + +This bash script will scrape planning applications from London's website at: https://london.ca/business-development/planning-development-applications/planning-applications + +The basic structure of the output files is: +``` +./LondonArchive/Planning Applications// + \- - 123 Example St/ + |- Info.txt + \- Attachments/ + |- .pdf + |- .pdf + \- etc etc +``` + +## Scrape London Transit Commission meetings (SCRAPE_LTC.SH) + +This bash script will scrape LTC meetings from their wordpress site at: https://www.londontransit.ca/agendas-and-minutes/ + +Attachments are downloaded as the HTML versions, converted to PDF. The original documents (linked from the agenda PDFs) may not always be OCRed, and the quality can be low. + +The basic structure of the output files is: +``` +./LondonArchive/LTC//// + |- .pdf + |- .pdf + \- Attachments/ + |- .pdf + |- .pdf + \- etc etc +``` + +## Scrape London Police Services meetings (SCRAPE_LPS.SH) + +This bash script will scrape LPS meetings from their wordpress site at: https://londonpoliceserviceboard.com/board-meetings/ + +The basic structure of the output files is: +``` +./LondonArchive/LPS//// + |- .pdf + |- .pdf + \- Attachments/ + |- .pdf + |- .pdf + \- etc etc +``` \ No newline at end of file