Replacing Http-Connector with Jsoup usage

Hey everyone

so use case came up recently for doing some web scraping with Camunda. From this we needed more powerful features from HTTP requests and so we loaded http://jsoup.org into camunda int eh shared engine.

From this we realized that this was actually a much more flexible solution compared to using HTTP-connector.

Jsoup is optimized for html/xml responses, but with a few tweaks to the configurations of a request (all standard/supported by Jsoup), we end up with a great JSON or “whatever” http request manager for requests and responses.

It also has many fixes for other problems with http-connect such as timeouts, attachments/binary data, buffer streams, and response time, etc. (many of which have been discussed in detail throughout the forum).

So what did we end up with!:

First we added Jsoup to Camunda. We put together a docker compose for you to make this as easy as possible:

Specifically: https://github.com/DigitalState/camunda-variations/tree/master/web-scrape

This will load camunda 7.7 tomcat with deployment-aware = false and add Jsoup to the classpath


Next:

In our case we use Javascript/nashorn for our scripting

So you can easily access Jsoup through Java or just through the script engine:

Javascript:

with (new JavaImporter(org.jsoup))
{

  // var body = {
  //   "myKey1": "myValue1",
  //   "myKey2": "myValue2",
  //   "myKey3": {
  //     "internal1":"internalV1",
  //     "internal2":"internalV2"
  //     },
  //   "myKey4": [
  //     1,2,3,4,5
  //     ]
  // }

  var doc = Jsoup.connect('http://date.jsontest.com')
                  .method(Java.type('org.jsoup.Connection.Method').GET)
                  // .method(Java.type('org.jsoup.Connection.Method').POST)
                  .header('Accept', 'application/json')
                  .header('Content-Type', 'application/json')
                  // .data('filterABC', 'subgroup1')
                  // .requestBody(JSON.stringify(body))
                  .timeout(30000)
                  .ignoreContentType(true) // This is used because Jsoup "approved" content-types parsing is enabled by default by Jsoup
                  .execute()

  var resBody = doc.body()
  var resStatusCode = doc.statusCode()
  var resStatusMessage = doc.statusMessage()
  var resContentType = doc.contentType()
  var resCharSet = doc.charset()

}

function spinify(body)
{
  var parsed = JSON.parse(body)
  var stringified = JSON.stringify(parsed)
  var spin = S(stringified)
  return spin
}

execution.setVariable('responseBodyString', spinify(resBody))

If you take a look through the Jsoup JavaDocs there are tons of options to customize the exact request and response you wish to work with. And you can easily replicate HTTP-connectors Input/Output abilities with some JS functions that look for Input parameters on the task.

See: https://wiki.openjdk.java.net/display/Nashorn/Nashorn+extensions for further details about all the ways you can import packages and classes with Nashorn. In this case we are using the with() method that allows us to encapsulate the package in its own scope and not pollute the global scope.


The one item that would be great from @camunda is the ability to execute scripts as part of Service Tasks, Send Tasks, and Throw Message Events so that the BPMN does not have to be script tasks everywhere (or not having to use Execution events / input/output parameters).


Some quick examples of really useful methods from Jsoup:

Requests:


Responses:


A few notes on reasons for things:

  1. the spinify method is used because in order to properly stingify the json response we need to parse it, otherwise the stringify will add /n characters for pretty printed json responses
  2. ignoreContentType() is used int he .connect() method because Jsoup does a pre-parsing content type validation check that looks for the pre-approve content types that Jsoup is optimized for (text, html, xhtml, xml, etc (HTML related content types))
  3. The .execute() method is used instead of get(), post(), etc because the execute() method does not parse the response for html.
  4. Take a look at the many different ways to create headers and data. Maps are supported, query params, Form Data, etc
  5. Note the enum usage in the .method() line. The enums are a specific type and because JS is typeless you need to force the type with Java.type(). See the https://wiki.openjdk.java.net/display/Nashorn/Nashorn+extensions docs for more info.
  6. Another great benefit is that Jsoup provides much more robust errors/exceptions that provide better Catch() capabilities and just general debugging.

Would love to hear anyones thoughts on this approach. So far our testing is showing very fast execution times and other than a patterns to develop (such as the spinify() method, and the usage of ignoreContentType() in the code sample above), everything “just works” as one would expect with a HTTP request DSL.

5 Likes

For anyone interested in web scraping html:

you can also import html into a Spin XML object and then use SPIN’s xPath capabilities to parse the html.

Something like this:

function getUrlAsXhtmlString(url)
{
  with (new JavaImporter(org.jsoup))
  {
    var doc = Jsoup.connect(url).get();
    doc.outputSettings().syntax(Java.type("org.jsoup.nodes.Document.OutputSettings.Syntax").xml);
    doc.outputSettings().charset('UTF-8');

    var docString = doc.html();

    return docString;
  }
}

function generateSpinVariables(xHtmlString)
{
  var htmlSpin = S(xHtmlString);
  return htmlSpin;
}

function scrape(url)
{
  var xHtmlString = getUrlAsXhtmlString(url);
  return generateSpinVariables(xHtmlString);
}

var xhtml = scrape('http://myurl');

var links = xhtml.xPath('//main//ul/li/a/@href').attributeList();

1 Like

Here is another interesting use case where you could use Jsoup + variablesInResult parameter of the /start process-definition endpoint to provide a validations workflow.

Example: you submit a form, and you need to have server side validations with other systems.

FaaS
Could also be 4 sequential scripts. The performance was the same.

So your setup would look something like:



The function.js looks like this:

if (execution.hasVariable('valueToValidate')){
  var valueToValidate = execution.getVariable('valueToValidate')
} else {
  throw 'valueToValidate variable does not exist'
}

with (new JavaImporter(org.jsoup))
{
  var doc = Jsoup.connect('http://ip.jsontest.com')
                  .method(Java.type('org.jsoup.Connection.Method').GET)
                  .header('Accept', 'application/json')
                  .data('filterABC', valueToValidate)
                  .timeout(5000)
                  .ignoreContentType(true)
                  .execute()

  var resBody = doc.body()
}

function spinify(body)
{
  var parsed = JSON.parse(body)
  var stringified = JSON.stringify(parsed)
  var spin = S(stringified)
  return spin
}

execution.setVariable('response', spinify(resBody))
// execution.setVariable('response', resBody)

The performance was about the same for returning a SPIN json object vs a string.

2 Likes

Stephen, thank you so much for all your sharings of real examples and how to’s!

I want to add my two cents :slight_smile: If you need only simple API call from scripting task, you may reach that by calling internal HTTP connector as shown below. In this case you do not have to add any library to classpath.

in JavaScript script task:

var httpConnector = org.camunda.connect.Connectors.http()
var resp = httpConnector.createRequest()
  .get()
  .url("http://<host>:[<port>]/artefacts/"+execution.getVariable('artefactId'))
  .execute()

var result = resp.getResponse()
resp.close()

execution.setVariable('responseBodyString', result)

Here we call Java HTTP connector class from JS, so we might use the scenario from official docs for POST, add headers and so on:

https://docs.camunda.org/manual/7.8/reference/connect/http-connector/#create-a-simple-http-request

regards,
dmitry

2 Likes

@dmitrysd great example! and it fills in a usage example from a while back about using HTTP Connector directly through scripting.

Two issues come to mind to be aware of when using this solution:

  1. You will still have the timeout issue that is discussed above and in links: HTTP Connector does not have a built in Timeout and thus can create jobs that last “forever”, which can result in Job Executor blocking (executor is stuck executing jobs that will never end)

  2. HTTP Connector cannot read binary data: so you will not be able to download binary files as they will be converted to Strings and data will be lossed. Again this is discussed in other pages that discuss HTTP Connector handling of binary data such as downloading PDFs.

1 Like

Thanks for the sharing. I am now using Camunda 7.9. Is there a guide for how to install Camunda with Jsoup?

thanks.

@liang you must add it to the class path similar to: https://github.com/DigitalState/camunda-variations/tree/master/web-scrape See the dockerfile. Then you can access it like any other class

1 Like

Hi, @StephenOTT. Thank you for this lib, it works.
But how can i handle timeout (exception) from Jsoup request not as error for Camunda? In my case there is no needs to retry if timeout

@raliev you use Jsoups timeout feature/method and you wrap your Jsoup request in a try/catch

Oh, of course! Thanks!! Can i do it with nashorn JS script too?

Yes. There is a specific exception from Jsoup that is related to a timeout.

1 Like