ultimate-sitemap-parser

Ultimate Website Sitemap Parser

Showing:

Popularity

Downloads/wk

0

GitHub Stars

88

Maintenance

Last Commit

1yr ago

Contributors

4

Package

Dependencies

4

License

GPLv3+

Categories

Readme

.. image:: https://travis-ci.org/mediacloud/ultimate-sitemap-parser.svg?branch=develop :target: https://travis-ci.org/mediacloud/ultimate-sitemap-parser :alt: Build Status

.. image:: https://readthedocs.org/projects/ultimate-sitemap-parser/badge/?version=latest :target: https://ultimate-sitemap-parser.readthedocs.io/en/latest/?badge=latest :alt: Documentation Status

.. image:: https://coveralls.io/repos/github/mediacloud/ultimate-sitemap-parser/badge.svg?branch=develop :target: https://coveralls.io/github/mediacloud/ultimate-sitemap-parser?branch=develop :alt: Coverage Status

.. image:: https://badge.fury.io/py/ultimate-sitemap-parser.svg :target: https://badge.fury.io/py/ultimate-sitemap-parser :alt: PyPI package

.. image:: https://pepy.tech/badge/ultimate-sitemap-parser :target: https://pepy.tech/project/ultimate-sitemap-parser :alt: Download stats

Website sitemap parser for Python 3.5+.

Features

  • Supports all sitemap formats:

    • XML sitemaps <https://www.sitemaps.org/protocol.html#xmlTagDefinitions>_
    • Google News sitemaps <https://support.google.com/news/publisher-center/answer/74288?hl=en>_
    • plain text sitemaps <https://www.sitemaps.org/protocol.html#otherformats>_
    • RSS 2.0 / Atom 0.3 / Atom 1.0 sitemaps <https://www.sitemaps.org/protocol.html#otherformats>_
    • Sitemaps linked from robots.txt <https://developers.google.com/search/reference/robots_txt#sitemap>_
  • Field-tested with ~1 million URLs as part of the Media Cloud project <https://mediacloud.org/>_

  • Error-tolerant with more common sitemap bugs

  • Tries to find sitemaps not listed in robots.txt

  • Uses fast and memory efficient Expat XML parsing

  • Doesn't consume much memory even with massive sitemap hierarchies

  • Provides a generated sitemap tree as easy to use object tree

  • Supports using a custom web client

  • Uses a small number of actively maintained third-party modules

  • Reasonably tested

Installation

.. code:: sh

pip install ultimate-sitemap-parser

Usage

.. code:: python

from usp.tree import sitemap_tree_for_homepage

tree = sitemap_tree_for_homepage('https://www.nytimes.com/')
print(tree)

sitemap_tree_for_homepage() will return a tree of AbstractSitemap subclass objects that represent the sitemap hierarchy found on the website; see a reference of AbstractSitemap subclasses <https://ultimate-sitemap-parser.readthedocs.io/en/latest/usp.objects.html#module-usp.objects.sitemap>_.

If you'd like to just list all the pages found in all of the sitemaps within the website, consider using all_pages() method:

.. code:: python

# all_pages() returns an Iterator
for page in tree.all_pages():
    print(page)

all_pages() method will return an iterator yielding SitemapPage objects; see a reference of SitemapPage <https://ultimate-sitemap-parser.readthedocs.io/en/latest/usp.objects.html#module-usp.objects.page>_.

Rate & Review

Great Documentation0
Easy to Use0
Performant0
Highly Customizable0
Bleeding Edge0
Responsive Maintainers0
Poor Documentation0
Hard to Use0
Slow0
Buggy0
Abandoned0
Unwelcoming Community0
100
No reviews found
Be the first to rate

Alternatives

No alternatives found

Tutorials

No tutorials found
Add a tutorial