Send Close Add comments: (status displays here)
Got it!  This site "robinsnyder.com" uses cookies. You consent to this by clicking on "Got it!" or by continuing to use this website.  Note: This appears on each machine/browser from which this site is accessed.
Python: Web scraping
by RS  admin@robinsnyder.com : 1024 x 640


1. Python: Web scraping
The following URL (Uniform Resource Locator) will be used to show web scraping using Python. First, some preliminaries. It is often necessary to do tasks and save data on a periodic basis. Rather than use a full database system, which incurs a huge overhead in complexity, a more no-SQL approach using a file system can be used.

2. CNN
CNN RSS feedsA CNN RSS feed will be used as an example RSS feed. Image as of 2020-03-23.

3. Terms of use
CNN RSS feed terms of useThis appears to say that you may not use any of the CNN RSS feeds to compete with CNN. Image as of 2020-03-23.

4. Links to content pages
CNN RSS feed link to content pagesThis appears to say that you may not use any of the CNN RSS feeds to compete with CNN. Image as of 2020-03-23.

5. Scraper
The term scraper is here used in the general sense of a scraper.

In terms of the above restrictions, Python is here being used to create a personal custom RSS reader and processor for personal educational and research use.

6. Disclaimer
At no time should one try to use techniques shown here for commercial use in any way without getting proper consent and legal advice!

Note: The Python program(s) shown here are toy programs to illustrate concepts. Some work would be required to make them generally useful for sophisticated personal educational and research use.

7. First step
The first step would be to grab the current RSS feed and save it for later processing. In general, a data flow pipeline is used for the entire system (to be discussed later).

This requires just one web access to get the RSS feed one time and save it as text such that it can be opened (e.g., using Notepad++) to inspect and determine how to further process it.

8. Feed list
A simple RSS feed list can be represented as follows, where only one item is used.


9. Notes
Note: In these program examples, the RSS feed list is repeated. An enhanced system would use a module (or JSON file) for the list and update functions - or even a class.

Note: In an enhanced system, a (random or uniform) time delay might be added between requests.

10. Today
Today is used for processing as follows.

Here is the Python code [#3]

Here is the output of the Python code.

Note: Most RSS files are huge and not easily read with a text editor such as Notepad++.

Luckily an RSS file has a fairly well-defined format and can be parsed as XML (Extensible Markup Language).

Note: The urlretrieve function combines the steps of opening the URL, reading the response, and saving the response text (as binary).

11. Processing the feeds
Next, a separate program is used to process the feeds.

A general program (omitted) might process all feeds for all dates, or just the dates that have not yet beet processed (in the data flow model used).

In an enhanced system, folders might be used, one per RSS feed name, in which case the RSS feed name would not be needed in the file name, just the date.

12. Feedparser
Here is the Python code [#4]

The feedparser package will be used.

One way to install the feedparser package is as follows.
D:\Python38\Scripts\pip3.exe install feedparser

Use the path to your installed version of Python. Here is the output of the Python code.


13. Data scrubbing
It is fairly easy to get the RSS feeds working, updating, and saving.

It can be very time-consuming to figure out and handle quirks in how each site uses their RSS feeds in order to get useful information.

14. Encoding issues
A common issue is character encoding issues.

Note: In this sample program, there were a number of character encoding issues from the CNN feed that were not resolved. Those are currently displayed as "(omitted - encoding issue)".

Many of these issues involve a site using extended ASCII encoding (sometimes with a code page) rather than, say, UTF-8 encoding.

15. End of page

by RS  admin@robinsnyder.com : 1024 x 640