mirror of
https://github.com/Fishwaldo/huginn.git
synced 2025-03-16 03:41:41 +00:00
Merge pull request #1205 from cantino/event_does_not_require_url
When the WebsiteAgent receives Events, we do not need to require that they contain a url keyword
This commit is contained in:
commit
5e95c7149a
4 changed files with 62 additions and 27 deletions
|
@ -16,16 +16,15 @@ module Agents
|
|||
description <<-MD
|
||||
The Website Agent scrapes a website, XML document, or JSON feed and creates Events based on the results.
|
||||
|
||||
Specify a `url` and select a `mode` for when to create Events based on the scraped data, either `all` or `on_change`.
|
||||
Specify a `url` and select a `mode` for when to create Events based on the scraped data, either `all`, `on_change`, or `merge` (if fetching based on an Event, see below).
|
||||
|
||||
`url` can be a single url, or an array of urls (for example, for multiple pages with the exact same structure but different content to scrape)
|
||||
The `url` option can be a single url, or an array of urls (for example, for multiple pages with the exact same structure but different content to scrape).
|
||||
|
||||
The WebsiteAgent can also scrape based on incoming events.
|
||||
|
||||
* If the Event contains a `url` key, that URL will be fetched.
|
||||
* For more control, you can set the `url_from_event` option and it will be used as a Liquid template to generate the url to access based on the Event.
|
||||
* If you set `data_from_event` to a Liquid template, it will be used to generate the data directly without fetching any URL. (For example, set it to `{{ html }}` to use HTML contained in the `html` key of the incoming Event.)
|
||||
* If you specify `merge` for the `mode` option, Huginn will retain the old payload and update it with the new values.
|
||||
* Set the `url_from_event` option to a Liquid template to generate the url to access based on the Event. (To fetch the url in the Event's `url` key, for example, set `url_from_event` to `{{ url }}`.)
|
||||
* Alternatively, set `data_from_event` to a Liquid template to use data directly without fetching any URL. (For example, set it to `{{ html }}` to use HTML contained in the `html` key of the incoming Event.)
|
||||
* If you specify `merge` for the `mode` option, Huginn will retain the old payload and update it with new values.
|
||||
|
||||
# Supported Document Types
|
||||
|
||||
|
@ -37,7 +36,7 @@ module Agents
|
|||
|
||||
# Scraping HTML and XML
|
||||
|
||||
When parsing HTML or XML, these sub-hashes specify how each extraction should be done. The Agent first selects a node set from the document for each extraction key by evaluating either a CSS selector in `css` or an XPath expression in `xpath`. It then evaluates an XPath expression in `value` (default: `.`) on each node in the node set, converting the result into string. Here's an example:
|
||||
When parsing HTML or XML, these sub-hashes specify how each extraction should be done. The Agent first selects a node set from the document for each extraction key by evaluating either a CSS selector in `css` or an XPath expression in `xpath`. It then evaluates an XPath expression in `value` (default: `.`) on each node in the node set, converting the result into a string. Here's an example:
|
||||
|
||||
"extract": {
|
||||
"url": { "css": "#comic img", "value": "@src" },
|
||||
|
@ -45,11 +44,11 @@ module Agents
|
|||
"body_text": { "css": "div.main", "value": ".//text()" }
|
||||
}
|
||||
|
||||
"@_attr_" is the XPath expression to extract the value of an attribute named _attr_ from a node, and ".//text()" is to extract all the enclosed texts. To extract the innerHTML, use "./node()"; and to extract the outer HTML, use ".".
|
||||
"@_attr_" is the XPath expression to extract the value of an attribute named _attr_ from a node, and `.//text()` extracts all the enclosed text. To extract the innerHTML, use `./node()`; and to extract the outer HTML, use `.`.
|
||||
|
||||
You can also use [XPath functions](http://www.w3.org/TR/xpath/#section-String-Functions) like `normalize-space` to strip and squeeze whitespace, `substring-after` to extract part of a text, and `translate` to remove comma from a formatted number, etc. Note that these functions take a string, not a node set, so what you may think would be written as `normalize-space(.//text())` should actually be `normalize-space(.)`.
|
||||
You can also use [XPath functions](http://www.w3.org/TR/xpath/#section-String-Functions) like `normalize-space` to strip and squeeze whitespace, `substring-after` to extract part of a text, and `translate` to remove commas from formatted numbers, etc. Note that these functions take a string, not a node set, so what you may think would be written as `normalize-space(.//text())` should actually be `normalize-space(.)`.
|
||||
|
||||
Beware that when parsing an XML document (i.e. `type` is `xml`) using `xpath` expressions all namespaces are stripped from the document unless a toplevel option `use_namespaces` is set to true.
|
||||
Beware that when parsing an XML document (i.e. `type` is `xml`) using `xpath` expressions, all namespaces are stripped from the document unless the top-level option `use_namespaces` is set to `true`.
|
||||
|
||||
# Scraping JSON
|
||||
|
||||
|
@ -92,7 +91,7 @@ module Agents
|
|||
|
||||
Set `uniqueness_look_back` to limit the number of events checked for uniqueness (typically for performance). This defaults to the larger of #{UNIQUENESS_LOOK_BACK} or #{UNIQUENESS_FACTOR}x the number of detected received results.
|
||||
|
||||
Set `force_encoding` to an encoding name if the website is known to respond with a missing, invalid or wrong charset in the Content-Type header. Note that a text content without a charset is taken as encoded in UTF-8 (not ISO-8859-1).
|
||||
Set `force_encoding` to an encoding name if the website is known to respond with a missing, invalid, or wrong charset in the Content-Type header. Note that a text content without a charset is taken as encoded in UTF-8 (not ISO-8859-1).
|
||||
|
||||
Set `user_agent` to a custom User-Agent name if the website does not like the default value (`#{default_user_agent}`).
|
||||
|
||||
|
@ -343,7 +342,7 @@ module Agents
|
|||
if url_template = options['url_from_event'].presence
|
||||
interpolate_options(url_template)
|
||||
else
|
||||
event.payload['url']
|
||||
interpolated['url']
|
||||
end
|
||||
check_urls(url_to_scrape, existing_payload)
|
||||
end
|
||||
|
|
|
@ -0,0 +1,22 @@
|
|||
class WebsiteAgentDoesNotUseEventUrl < ActiveRecord::Migration
|
||||
def up
|
||||
# Until this migration, if a WebsiteAgent received Events and did not have a `url_from_event` option set,
|
||||
# it would use the `url` from the Event's payload. If the Event did not have a `url` in its payload, the
|
||||
# WebsiteAgent would do nothing. This migration assumes that if someone has wired a WebsiteAgent to receive Events
|
||||
# and has not set `url_from_event` or `data_from_event`, they were trying to use the Event's `url` payload, so we
|
||||
# set `url_from_event` to `{{ url }}` for them.
|
||||
Agents::WebsiteAgent.find_each do |agent|
|
||||
next unless agent.sources.count > 0
|
||||
|
||||
if !agent.options['data_from_event'].present? && !agent.options['url_from_event'].present?
|
||||
agent.options['url_from_event'] = '{{ url }}'
|
||||
agent.save!
|
||||
puts ">> Setting `url_from_event` on WebsiteAgent##{agent.id} to {{ url }} because it is wired"
|
||||
puts ">> to receive Events, and the WebsiteAgent no longer uses the Event's `url` value directly."
|
||||
end
|
||||
end
|
||||
end
|
||||
|
||||
def down
|
||||
end
|
||||
end
|
|
@ -397,6 +397,8 @@ describe AgentsController do
|
|||
it "accepts an event" do
|
||||
sign_in users(:bob)
|
||||
agent = agents(:bob_website_agent)
|
||||
agent.options['url_from_event'] = '{{ url }}'
|
||||
agent.save!
|
||||
url_from_event = "http://xkcd.com/?from_event=1".freeze
|
||||
expect {
|
||||
post :dry_run, id: agent, event: { url: url_from_event }
|
||||
|
|
|
@ -768,20 +768,13 @@ fire: hot
|
|||
@event = Event.new
|
||||
@event.agent = agents(:bob_rain_notifier_agent)
|
||||
@event.payload = {
|
||||
'url' => 'http://xkcd.com',
|
||||
'link' => 'Random',
|
||||
'url' => 'http://foo.com',
|
||||
'link' => 'Random'
|
||||
}
|
||||
end
|
||||
|
||||
it "should scrape from the url element in incoming event payload" do
|
||||
expect {
|
||||
@checker.options = @valid_options
|
||||
@checker.receive([@event])
|
||||
}.to change { Event.count }.by(1)
|
||||
end
|
||||
|
||||
it "should use url_from_event as url to scrape if it exists when receiving an event" do
|
||||
stub = stub_request(:any, 'http://example.org/?url=http%3A%2F%2Fxkcd.com')
|
||||
it "should use url_from_event as the url to scrape" do
|
||||
stub = stub_request(:any, 'http://example.org/?url=http%3A%2F%2Ffoo.com')
|
||||
|
||||
@checker.options = @valid_options.merge(
|
||||
'url_from_event' => 'http://example.org/?url={{url | uri_escape}}'
|
||||
|
@ -791,9 +784,16 @@ fire: hot
|
|||
expect(stub).to have_been_requested
|
||||
end
|
||||
|
||||
it "should use the Agent's `url` option if url_from_event is not set" do
|
||||
expect {
|
||||
@checker.options = @valid_options
|
||||
@checker.receive([@event])
|
||||
}.to change { Event.count }.by(1)
|
||||
end
|
||||
|
||||
it "should allow url_from_event to be an array of urls" do
|
||||
stub1 = stub_request(:any, 'http://example.org/?url=http%3A%2F%2Fxkcd.com')
|
||||
stub2 = stub_request(:any, 'http://google.org/?url=http%3A%2F%2Fxkcd.com')
|
||||
stub1 = stub_request(:any, 'http://example.org/?url=http%3A%2F%2Ffoo.com')
|
||||
stub2 = stub_request(:any, 'http://google.org/?url=http%3A%2F%2Ffoo.com')
|
||||
|
||||
@checker.options = @valid_options.merge(
|
||||
'url_from_event' => ['http://example.org/?url={{url | uri_escape}}', 'http://google.org/?url={{url | uri_escape}}']
|
||||
|
@ -805,7 +805,10 @@ fire: hot
|
|||
end
|
||||
|
||||
it "should interpolate values from incoming event payload" do
|
||||
stub_request(:any, /foo/).to_return(body: File.read(Rails.root.join("spec/data_fixtures/xkcd.html")), status: 200)
|
||||
|
||||
expect {
|
||||
@valid_options['url_from_event'] = '{{ url }}'
|
||||
@valid_options['extract'] = {
|
||||
'from' => {
|
||||
'xpath' => '*[1]',
|
||||
|
@ -821,11 +824,21 @@ fire: hot
|
|||
}.to change { Event.count }.by(1)
|
||||
|
||||
expect(Event.last.payload).to eq({
|
||||
'from' => 'http://xkcd.com',
|
||||
'from' => 'http://foo.com',
|
||||
'to' => 'http://dynamic.xkcd.com/random/comic/',
|
||||
})
|
||||
end
|
||||
|
||||
it "should use the options url if no url is in the event payload, and `url_from_event` is not provided" do
|
||||
@checker.options['mode'] = 'merge'
|
||||
@event.payload.delete('url')
|
||||
expect {
|
||||
@checker.receive([@event])
|
||||
}.to change { Event.count }.by(1)
|
||||
expect(Event.last.payload['title']).to eq('Evolving')
|
||||
expect(Event.last.payload['link']).to eq('Random')
|
||||
end
|
||||
|
||||
it "should interpolate values from incoming event payload and _response_" do
|
||||
@event.payload['title'] = 'XKCD'
|
||||
|
||||
|
@ -1065,7 +1078,6 @@ fire: hot
|
|||
event = @events[6]
|
||||
expect(event.payload['url']).to eq("https://www.google.ca/search?q=%EC%9C%84%ED%82%A4%EB%B0%B1%EA%B3%BC:%EB%8C%80%EB%AC%B8")
|
||||
end
|
||||
|
||||
end
|
||||
end
|
||||
end
|
||||
|
|
Loading…
Add table
Reference in a new issue