gns

google-news-scraper

Lightweight scraper for Google News

Showing:

Popularity

Downloads/wk

36

GitHub Stars

45

Maintenance

Last Commit

14d ago

Contributors

5

Package

Dependencies

3

License

MIT

Type Definitions

Tree-Shakeable

No?

Categories

Readme

Google News Scraper

A lightweight package that scrapes article data from Google News. Simply pass a keyword or phrase, and the results are returned as an array of JSON objects.

Installation

# Install via NPM
npm install google-news-scraper

# Install via Yarn
yarn add google-news-scraper

Usage

// Require the package
const googleNewsScraper = require('google-news-scraper')

// Execute within an async function, pass a config object (further documentation below)
const articles = await googleNewsScraper({
    searchTerm: "The Oscars",
    prettyURLs: false,
    queryVars: {
        hl:"en-US",
        gl:"US",
        ceid:"US:en"
      },
    timeframe: "5d",
    puppeteerArgs: []
})

Config

The config object passed to the function above has the following properties:

Search Term (required)

This is the search query you'd like to find articles for.

Query Vars (optional)

Additional query params to add to the URL.

Pretty URLs (required)

The URLs that Google News supplies for each article are "ugly" redirect links (eg: "https://news.google.com/articles/CAIiEPgfWP_e7PfrSwLwvWeb5msqFwgEKg8IACoHCAowjuuKAzCWrzwwt4QY?hl=en-GB&gl=GB&ceid=GB%3Aen").

You can optionally ask the scraper to follow the redirect and retrieve the actual "pretty" URL (eg: "https://www.nytimes.com/2020/01/22/movies/expanded-best-picture-oscar.html").

As you can imagine, this results in lots of additional HTTP requests, which negatively impact the scraper's performance. In testing, following redirects took around five times longer on average.

Timeframe (optional)

The results can be filtered to articles published within a given timeframe prior to the requesst. The default is 7 days.

Puppeteer Arguments (optional)

An array of Chromium flags to pass to the browser instance. By default, this will be an empty array. A full list of available flags can be found here. NB: if you are launching this in a Heroku app, you will need to pass the --no-sandbox and --disable-setuid-sandbox flags, as explained in this SO answer.

The format of the timeframe is a string comprised of a number, followed by a letter prepresenting the time operator. For example 1y would signify 1 year. Full list of operators below:

  • h = hours (eg: 12h)
  • d = days (eg: 7d)
  • m = months (eg: 6m)
  • y = years (eg: 1y)

Output

The output is an array of JSON objects, with each article following the structure below:

[
    {
        "title":  "Article title",
        "subtitle":  "Article subtitle",
        "link":  "http://url-to-website.com/path/to/article",
        "image":"http://url-to-website.com/path/to/image.jpg",
        "source":  "Name of publication",
        "time":  "Time/date published (human-readable)"
    }
]

Performance

My test query returned 104 results, which took 1.566 seconds without redirects, and 7.36 seconds with redirects. I'm on a fibre connection, and other queries may return a different number of results, so your mileage may vary.

Upkeep

Please note that this is a web-scraper, which relies on DOM selectors, so any fundamental changes in the markup on the Google News site will probably break this tool. I'll try my best to keep it up-to-date, but changes to the markup on Google News will be silent and therefore difficult to keep track of. Feel free to submit an issue if the tool stops working.

Issues

Please report bugs via the issue tracker.

Contribute

Feel free to submit a PR if you've fixed an open issue. Thank you.

Rate & Review

Great Documentation0
Easy to Use0
Performant0
Highly Customizable0
Bleeding Edge0
Responsive Maintainers0
Poor Documentation0
Hard to Use0
Slow0
Buggy0
Abandoned0
Unwelcoming Community0
100
No reviews found
Be the first to rate

Alternatives

No alternatives found

Tutorials

No tutorials found
Add a tutorial