Provided by: python-podcastparser-doc_0.6.10-2_all
NAME
podcastparser - podcastparser Documentation podcastparser is a simple and fast podcast feed parser library in Python. The two primary users of the library are the gPodder Podcast Client and the gpodder.net web service. The following feed types are supported: • Really Simple Syndication (RSS 2.0) • Atom Syndication Format (RFC 4287) The following specifications are supported: • Paged Feeds (RFC 5005) • Podlove Simple Chapters • Podcast Index Podcast Namespace These formats only specify the possible markup elements and attributes. We recommend that you also read the Podcast Feed Best Practice guide if you want to optimize your feeds for best display in podcast clients. Where times and durations are used, the values are expected to be formatted either as seconds or as RFC 2326 Normal Play Time (NPT). import podcastparser import urllib.request feedurl = 'http://example.com/feed.xml' parsed = podcastparser.parse(feedurl, urllib.request.urlopen(feedurl)) # parsed is a dict import pprint pprint.pprint(parsed) For both RSS and Atom feeds, only a subset of elements (those that are relevant to podcast client applications) is parsed. This section describes which elements and attributes are parsed and how the contents are interpreted/used.
RSS
rss@xml:base Base URL for all relative links in the RSS file. rss/channel Podcast. rss/channel/title Podcast title (whitespace is squashed). rss/channel/link Podcast website. rss/channel/description Podcast description (whitespace is squashed). rss/channel/image/url Podcast cover art. rss/channel/itunes:image Podcast cover art (alternative). rss/channel/itunes:type Podcast type (whitespace is squashed). One of ‘episodic’ or ‘serial’. rss/channel/itunes:keywords Podcast keywords (whitespace is squashed). rss/channel/atom:link@rel=payment Podcast payment URL (e.g. Flattr). rss/channel/generator A string indicating the program used to generate the channel. (e.g. MightyInHouse Content System v2.3). rss/channel/language Podcast language. rss/channel/itunes:author The group responsible for creating the show. rss/channel/itunes:owner The podcast owner contact information. The <itunes:owner> tag information is for administrative communication about the podcast and isn’t displayed in Apple Podcasts rss/channel/itunes:category The show category information. rss/channel/itunes:explicit Indicates whether podcast contains explicit material. rss/channel/itunes:new-feed-url The new podcast RSS Feed URL. rss/channel/podcast:locked If the podcast is currently locked from being transferred. rss/channel/podcast:funding Funding link for podcast. rss/redirect/newLocation The new podcast RSS Feed URL. rss/channel/item Episode. rss/channel/item/guid Episode unique identifier (GUID), mandatory. rss/channel/item/title Episode title (whitespace is squashed). rss/channel/item/link Episode website. rss/channel/item/description Episode description. If it contains html, it’s returned as description_html. Otherwise it’s returned as description (whitespace is squashed). See Mozilla’s article Why RSS Content Module is Popular rss/channel/item/itunes:summary Episode description (whitespace is squashed). rss/channel/item/itunes:subtitle Episode subtitled / one-line description (whitespace is squashed). rss/channel/item/content:encoded Episode description in HTML. Best source for description_html. rss/channel/item/itunes:duration Episode duration. rss/channel/item/pubDate Episode publication date. rss/channel/item/atom:link@rel=payment Episode payment URL (e.g. Flattr). rss/channel/item/atom:link@rel=enclosure File download URL (@href), size (@length) and mime type (@type). rss/channel/item/itunes:image Episode art URL. rss/channel/item/media:thumbnail Episode art URL. rss/channel/item/media:group/media:thumbnail Episode art URL. rss/channel/item/media:content File download URL (@url), size (@fileSize) and mime type (@type). rss/channel/item/media:group/media:content File download URL (@url), size (@fileSize) and mime type (@type). rss/channel/item/enclosure File download URL (@url), size (@length) and mime type (@type). rss/channel/item/psc:chapters Podlove Simple Chapters, version 1.1 and 1.2. rss/channel/item/psc:chapters/psc:chapter Chapter entry (@start, @title, @href and @image). rss/channel/item/itunes:explicit Indicates whether episode contains explicit material. rss/channel/item/itunes:author The group responsible for creating the episode. rss/channel/item/itunes:season The season number of the episode. rss/channel/item/itunes:episode An episode number. rss/channel/item/itunes:episodeType The episode type. This flag is used if an episode is a trailer or bonus content. rss/channel/item/podcast:chapters The url to a JSON file describing the chapters. Only the url is added to the data as fetching an external URL would be unsafe. rss/channel/item/podcast:person A person involved in the episode, e.g. host, or guest. rss/channel/item/podcast:transcript The url for the transcript file associated with this episode.
ATOM
For Atom feeds, podcastparser will handle the following elements and attributes: atom:feed Podcast. atom:feed/atom:title Podcast title (whitespace is squashed). atom:feed/atom:subtitle Podcast description (whitespace is squashed). atom:feed/atom:icon Podcast cover art. atom:feed/atom:link@href Podcast website. atom:feed/atom:entry Episode. atom:feed/atom:entry/atom:id Episode unique identifier (GUID), mandatory. atom:feed/atom:entry/atom:title Episode title (whitespace is squashed). atom:feed/atom:entry/atom:link@rel=enclosure File download URL (@href), size (@length) and mime type (@type). atom:feed/atom:entry/atom:link@rel=(self|alternate) Episode website. atom:feed/atom:entry/atom:link@rel=payment Episode payment URL (e.g. Flattr). atom:feed/atom:entry/atom:content Episode description (in HTML or plaintext). atom:feed/atom:entry/atom:published Episode publication date. atom:feed/atom:entry/media:thumbnail Episode art URL. atom:feed/atom:entry/media:group/media:thumbnail Episode art URL. atom:feed/atom:entry/psc:chapters Podlove Simple Chapters, version 1.1 and 1.2. atom:feed/atom:entry/psc:chapters/psc:chapter Chapter entry (@start, @title, @href and @image). Simplified, fast RSS parser exception podcastparser.FeedParseError(msg, exception, locator) Exception raised when asked to parse an invalid feed This exception allows users of this library to catch exceptions without having to import the XML parsing library themselves. class podcastparser.PodcastHandler(url, max_episodes) characters(chars) Receive notification of character data. The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity so that the Locator provides useful information. endElement(name) Signals the end of an element in non-namespace mode. The name parameter contains the name of the element type, just as with the startElement event. startElement(name, attrs) Signals the start of an element in non-namespace mode. The name parameter contains the raw XML 1.0 name of the element type as a string and the attrs parameter holds an instance of the Attributes class containing the attributes of the element. class podcastparser.RSSItemDescription RSS 2.0 almost encourages to put html content in item/description but content:encoded is the better source of html content and itunes:summary is known to contain the short textual description of the item. So use a heuristic to attribute text to either description or description_html, without overriding existing values. podcastparser.file_basename_no_extension(filename) Returns filename without extension >>> file_basename_no_extension('/home/me/file.txt') 'file' >>> file_basename_no_extension('file') 'file' podcastparser.is_html(text) Heuristically tell if text is HTML By looking for an open tag (more or less:) >>> is_html(‘<h1>HELLO</h1>’) True >>> is_html(‘a < b < c’) False podcastparser.normalize_feed_url(url) Normalize and convert a URL. If the URL cannot be converted (invalid or unknown scheme), None is returned. This will also normalize feed:// and itpc:// to http://. >>> normalize_feed_url('itpc://example.org/podcast.rss') 'http://example.org/podcast.rss' If no URL scheme is defined (e.g. “curry.com”), we will simply assume the user intends to add a http:// feed. >>> normalize_feed_url('curry.com') 'http://curry.com/' It will also take care of converting the domain name to all-lowercase (because domains are not case sensitive): >>> normalize_feed_url('http://Example.COM/') 'http://example.com/' Some other minimalistic changes are also taken care of, e.g. a ? with an empty query is removed: >>> normalize_feed_url('http://example.org/test?') 'http://example.org/test' Leading and trailing whitespace is removed >>> normalize_feed_url(' http://example.com/podcast.rss ') 'http://example.com/podcast.rss' Incomplete (too short) URLs are not accepted >>> normalize_feed_url('http://') is None True Unknown protocols are not accepted >>> normalize_feed_url('gopher://gopher.hprc.utoronto.ca/file.txt') is None True podcastparser.parse(url, stream, max_episodes=0) Parse a podcast feed from the given URL and stream Parameters • url – the URL of the feed. Will be used to resolve relative links • stream – file-like object containing the feed content • max_episodes – maximum number of episodes to return. 0 (default) means no limit Returns a dict with the parsed contents of the feed podcastparser.parse_length(text) Parses a file length >>> parse_length(None) -1 >>> parse_length('0') -1 >>> parse_length('unknown') -1 >>> parse_length('100') 100 podcastparser.parse_pubdate(text) Parse a date string into a Unix timestamp >>> parse_pubdate('Fri, 21 Nov 1997 09:55:06 -0600') 880127706 >>> parse_pubdate('2003-12-13T00:00:00+02:00') 1071266400 >>> parse_pubdate('2003-12-13T18:30:02Z') 1071340202 >>> parse_pubdate('Mon, 02 May 1960 09:05:01 +0100') -305049299 >>> parse_pubdate('') 0 >>> parse_pubdate('unknown') 0 podcastparser.parse_time(value) Parse a time string into seconds See RFC2326, 3.6 “Normal Play Time” (HH:MM:SS.FRACT) >>> parse_time('0') 0 >>> parse_time('128') 128 >>> parse_time('00:00') 0 >>> parse_time('00:00:00') 0 >>> parse_time('00:20') 20 >>> parse_time('00:00:20') 20 >>> parse_time('01:00:00') 3600 >>> parse_time(' 03:02:01') 10921 >>> parse_time('61:08') 3668 >>> parse_time('25:03:30 ') 90210 >>> parse_time('25:3:30') 90210 >>> parse_time('61.08') 61 >>> parse_time('01:02:03.500') 3723 >>> parse_time(' ') 0 podcastparser.parse_type(text) “normalize” a mime type >>> parse_type('text/plain') 'text/plain' >>> parse_type('text') 'application/octet-stream' >>> parse_type('') 'application/octet-stream' >>> parse_type(None) 'application/octet-stream' podcastparser.remove_html_tags(html) Remove HTML tags from a string and replace numeric and named entities with the corresponding character, so the HTML text can be displayed in a simple text view. podcastparser.squash_whitespace(text) Combine multiple whitespaces into one, trim trailing/leading spaces >>> squash_whitespace(' some text with a lot of spaces ') 'some text with a lot of spaces' podcastparser.squash_whitespace_not_nl(text) Like squash_whitespace, but don’t squash linefeeds and carriage returns >>> squash_whitespace_not_nl(' linefeeds\ncarriage\r returns') 'linefeeds\ncarriage\r returns' This is a list of podcast-related XML namespaces that are not yet supported by podcastparser, but might be in the future.
CHAPTER MARKS
• rawvoice RSS: Rating, Frequency, Poster, WebM, MP4, Metamark (kind of chapter-like markers) • IGOR: Chapter Marks
OTHERS
• libSYN RSS Extensions: contactPhone, contactEmail, contactTwitter, contactWebsite, wallpaper, pdf, background • Comment API: Comments to a given item (readable via RSS) • MVCB: Error Reports To Field (usually a mailto: link) • Syndication Module: Update period, frequency and base (for skipping updates) • Creative Commons RSS: Creative commons license for the content • Pheedo: Original link to website and original link to enclosure (without going through pheedo redirect) • WGS84: Geo-Coordinates per item • Conversations Network: Intro duration in milliseconds (for skipping the intro), ratings • purl DC Elements: dc:creator (author / creator of the podcast, possibly with e-mail address) • Tristana: tristana:self (canonical URL to feed) • Blip: Show name, show page, picture, username, language, rating, thumbnail_src, license • Index • Module Index • Search Page
AUTHOR
gPodder Team
COPYRIGHT
2023, gPodder Team