Parsing from Sitemap

In this section, you will familiarize yourself with the ability to create posts for your website using sitemap.xml file as a feed.

The News Parser plugin now allows you to parse content directly from sitemap files (sitemap.xml). Sitemaps provide a structured list of URLs on a website, making it an efficient way to discover and parse multiple posts from a target site.

To parse content from a sitemap, follow these steps:

  1. Navigate to the Sitemap Parsing Section: In your WordPress admin panel, go to the News-Parsing -> Parsing Sitemap menu. This will open the Sitemap parsing interface.

  2. Enter the Sitemap URL: In the provided search bar, enter the complete URL of the sitemap.xml file you wish to parse. For example: https://www.example.com/sitemap.xml.

    • Finding the Sitemap URL: If you are unsure of the exact sitemap URL for a website, a common practice is to check the robots.txt file located at the root of the website (e.g., https://www.example.com/robots.txt). This file often contains a line like Sitemap: https://www.example.com/sitemap.xml indicating the sitemap's location.

  3. Parse Sitemap: After entering the sitemap URL, click the "Parse sitemap" button to initiate parsing. The plugin will retrieve and process the sitemap file from the provided URL.

  4. Review Parsed Sitemap Data: Once the sitemap data is fetched and processed, a list of URLs extracted from the sitemap will be displayed on your screen. Each URL in this list represents a potential post to parse.

  5. Create a Parsing Template (Recommended for Autopilot & Efficiency): At this stage, you can open the Visual Constructor by clicking on the Visual Constructor icon (usually located near the post list or at the bottom of the page). Use the Visual Constructor to create a parsing template specifically tailored to the website structure of the posts listed in the sitemap. Saving a parsing template is highly recommended, especially if you plan to use the Autopilot feature for automatic parsing or if you intend to parse multiple posts from this sitemap source in the future. A template ensures consistent and efficient content extraction.

  6. Manual Post Selection and Parsing (For Selective Parsing): Alternatively, if you only want to parse a specific selection of posts from the sitemap, you can manually select the URLs you are interested in from the displayed list. After selecting the desired posts, you can initiate manual parsing by clicking a button such as "Parse Selected Posts" or similar. This will parse the content from only the URLs you have chosen.

  7. Save as Draft or Publish: After parsing either using a template or manual selection, the extracted content for each post will be presented. You can then review the parsed content, make any necessary edits, and save the posts as drafts or publish them directly to your website.

Important Considerations when Parsing from Sitemaps:

  • Avoid Parsing Excessive Posts Simultaneously: Parsing a very large number of posts at once, especially from a sitemap containing thousands of URLs, can put a significant load on your server. Since parsing is processed on your server, initiating the parsing of too many posts concurrently could potentially resemble a Denial-of-Service (DoS) attack on the target website and may also strain your own server resources. It is strongly recommended to parse posts in smaller batches or utilize the Autopilot feature with appropriate scheduling and limits to avoid overloading servers.

  • Template Usage for Autopilot: If you intend to use the Autopilot feature to automatically parse posts from the sitemap source on a recurring schedule, creating and saving a parsing template is essential. The Autopilot will use this template to consistently extract content from new posts discovered in the sitemap over time.

By following these steps and heeding the important considerations, you can effectively leverage the Sitemap parsing feature of the News Parser plugin to efficiently discover and import content from websites listed in sitemap.xml files. Remember to use templates for efficient and consistent parsing, and be mindful of server load when parsing large numbers of posts.

Last updated