1, has been rendering the page after the HTML code in general in your browser to view the page code is directly downloaded to a browser page without the code for the JS AJAX before rendering the content of the output can not be obtained so that we get the page content and go directly Crawl back socket without distinction of any kind can not be required or the contents page
2, JS function and call the page elements of these events are very simple to rely on the user-driven EVENT, and we crawled through the SOCKET page just for us but can not flow to simulate EVENT users do not have the EVENT page and need to drive EVENT even show the contents of display can not crawl more of the
In the windows through WEBBROWSER control, we can very simple to solve the above two issues, but only temporarily under LINUX dependent on core FIREFOX think of ways to the GECKO. As a result of the use of JAVA as development language, we use the JAR package JREX it packaged GECKO call the DLL to be localized, so that we can use the JAVA language to direct the use of GECKO.
Is the core of the JRexCanvas be through conversion of a Document object and then use its own DocumentRange way createRange be rendering the page after the content (if familiar with the comrades WEBBROWSER will find many types of basic methods were very similar, but the lack of documentation is JREX can only be a toss about their own blind)
Document doc = navigation.getDocument ();
DocumentRange range = ((org.mozilla.jrex.dom.JRexDocumentImpl) doc). GetDocumentRange ();
System.out.println (xmlToString (range.createRange (). GetCommonAncestorContainer ())); with different WEBBROWSER not createTXTRange () method to the HTML directly from the pure text, and then we can only be dealt with by its own NODE
Auxiliary function used to output NODE:
public static String xmlToString (Node node) throws Exception (
Source source = new DOMSource (node);
StringWriter stringWriter = new StringWriter ();
Result result = new StreamResult (stringWriter);
TransformerFactory factory = TransformerFactory.newInstance ();
Transformer transformer = factory.newTransformer ();
transformer.setOutputProperty (OutputKeys.METHOD, "html");
transformer.transform (source, result);
return stringWriter.getBuffer (). toString ();
)








Responses to “Gecko (jrex) study the contents of record page JS AJAX crawl to deal with (1)”