Send Close Add comments: (status displays here)
Got it!  This site "robinsnyder.com" uses cookies. You consent to this by clicking on "Got it!" or by continuing to use this website.  Note: This appears on each machine/browser from which this site is accessed.
Web scraping
by RS  admin@robinsnyder.com : 1024 x 640


1. Web scraping
Web scraping is a technique for automatically or semi-automatically downloading content from the Internet.

2. Ethical issue
Before proceeding, some ethical considerations are covered.

3. Levels of scraping
There are several levels of scraping, which may have different rules for how to acquire and use the content. These levels include the following. Scraping for personal use (modeling personal behavior in terms of click rates and download volume) are fairly permissive.

At the other end of the spectrum, scraping for commercial use should be thoroughly investigated and approval received from the host site (in terms of their terms of use, etc) before doing scraping.

4. General rules
As a general personal rule, I sometimes scrape sites who want their content to be consumed and, in who often provide ways to do that.

So, if I could sit at a computer for an hour and click on a few hundred links, do a "save as" command, and process or consume the content, I

5. Headlines
The example used here is that of headline feeds whereby news headlines over time can be used for (personal) research purposes to identify and study trends, sentiment, etc. The full articles can be used but, for many research purposes, just a headline and an accompanying paragraph is sufficient.

News feeds can be obtained via an type of XML (Extensible Markup Language) feed called RSS (Really Simple Syndication).

6. Data analysis
There are many ways to analyze the collected data, but first one must have a way to collect the data and then actually collect the data.

7. Academic conferences
Another feed type that I have used is academic conference announcements - to consolidate and display upcoming conferences of interest.

8. One approach
One (manual) approach is to use an RSS reader. I have used the Open Source Thunderbird email system to subscribe to RSS feeds. These feeds are stored in a standard email format.

I have used Lua to read that email format (conference announcements).

I have used Python to (more easily) read that email format (for student submissions via email, headlines, etc.).

9. Another approach
Another approach is to use Python to directly read the RSS feed on a periodic basis and save the results for later processing.

An RSS reader will update the RSS feed many times a day (depending on the setting).

I have found that, for research purposes, a once-a-day download and save is sufficient.

10. Refined approach
Here is the refined approach to collect the raw data. This works well as most RSS feeds contain data for the last several days.

Note that sites, once in a while, change their RSS URL so a good program would detect that so that that URL can be adjusted.

11. RSS Reader
Whenever accessing RSS feeds anagrammatically, it can be useful to subscribe to the same feed using an actual RSS reader so you can see what you should be getting.

The reason for not using an actual RSS reader is that one must then insure that the program is run once-a-day. Forget a data, a missing day of data.

12. Once a day
How does one insure that the program is run once a day? I typically set the job to run in the middle of the morning (Eastern Time Zone) when there is a lot of unused bandwidth in the United States.

13. Always running computer
How does not insure that the computer on which the job is run every day?

I have used a (small and older and very low power) Raspberry Pi that runs all the time (on a UPS) and runs the designated cron jobs and saves the data locally.

14. Approaches
There are (at least) two general ways for automated scraping of web sites.

15. End of page

by RS  admin@robinsnyder.com : 1024 x 640