ChicagocrimeAndScrapePI
From MashupCamp
This presentation had two parts:
* Adrian Holovaty will present chicagocrime.org. * We'll go over screen-scraping and Thor Muller will lead a discussion on the emerging phenomenon of the "ScrapePI."
Here are notes for the talk that Thor Muller gave about ScrapePI:
What is the problem?
- Basic services aren’t available as APIs: wikipedia, tvguide, imdb, Craig’s List!!!, and virtually all government/public domain sources
- Page scraping is annoying/hard to maintain
- Hacks (like this: http://www.hackdiary.com/archives/000070.html) are pretty iffy too
Examples of ScrapePIs
- Ontok Wikipedia API
- XMLTV (uses scrapers from many contributors)
- Comparison shopping and Ebay examples
Practical issues
- Changes to source pages/breaking the scraper
- Images and media may get blocked
- Coarse lookups only (e.g. no “what has changed” lookups)
- Bandwidth consumption – both from source and for ScrapePI
- Adding intermediary
- Authentication/Cookie (see Yodlee)
- Limitations on content use; stay complementary with source
- Sensitive data types
- Regulated data
Legal
- No explicit rights to use commercial data
- What is fair use?
- What constitutes a unique collection of data (database)?
- Data Protection Act (UK)
- Check for robots.txt
- When are there damages
Strategic


