Purpose
Scrape information from a specified URL and write the result to a text-type attribute.
NOTE: This task cannot be used to scrape LinkedIn, see more at the following link: Crawlers are prohibited by LinkedIn
Category Location: All, Data Services/Enrichment
Field description
- Scrape this URL: The URL string you would like to be scraped. You can specify a URL to be scraped by (a) directly typing in the full URL (b) specifying an attribute that contains the full URL (c) creating the URL using a combination of text and attributes. To select an attribute for the URL or for inclusion as a parameter in the URL, use the “@” sign then select the attribute from the list that appears. An attribute that is correctly selected should appear in the following format: {{attribute_name}}.
- Result Attribute: The attribute that the scraped data will be written to. This attribute must always be of type Text, single value (comma is a part of text) and the scraped data will be displayed as a single block of text.
- Add attribute: Select to create an attribute to hold the results
Example of A (directly typing in the full URL)
Example of B (specifying an attribute that contains the full URL) For each record, the URL specified in the URL attribute (i.e.: {{Complete URL}}) is what will be scraped during task execution.
Example of C (Creating the URL using a combination of text and attributes.) For each record, the attribute values will be populated and the resulting URL will be scraped. For example, if a given record has a Specialty=“Pediatrics”, City= “San Francisco” and State=”CA”, the URL that will be scraped for that record will be: “https://www.vitals.com/search?query=Pediatrics&city_state=San Francisco, CA”.
After you have scraped a URL, you will need to use the Ask AI template to format the results per your requirements.
An example of results is shown below:
Limitations:
- Certain URLs are NOT permitted to be scraped using this task! A list of the URLs that Openprise will not scrape can be found below. If you attempt to scrape a forbidden URL, you will see an error in the record status fields indicating that we do not support scraping it. This list will be updated periodically to reflect any additional URLs that we will not support scraping.
- Output returned from the web scraping action may be truncated to ensure that the resulting scraped data is able to be stored in Openprise. If the returned data is truncated, you will see the text appended with the label “[TRUNCATED]”.
Output Attributes:
Attributes designed to give you additional context regarding the scraping status of a particular URL.
- web_scraper_status: Status of the scraping action/whether the specified URL was able to be scraped or not.
- web_scraper_error: Text of the error message, if an issue was encountered during the scraping process. For example, if you attempt to scrape a forbidden URL, it will be indicated in this field.
-
web_scraper_url_scraped: The complete URL you specified to be scraped in the task configuration. If you specify attributes as a part of the URL, the values for those attributes will be populated here so you can see the complete URL that corresponds to the scraped data.