| |||||
It builds on Lucene Java, adding new web-specifics, such as parsers for HTML, a crawler, a link-graph database and other document formats.
Nutch can run on a single machine, but works better in Hadoop clusters.
Plugins are available for expanding its usage spectrum.
What's New in This Release: [ read full changelog ]
· Renamed HTMLParseFilter into ParseFilter.
· Remove remaining robots/IP blocking code in lib-http.
· Port logging to slf4j.
· External parser supports encoding attribute.
· Ivy configuration settings don't include Gora.
· Injector should add the metadata before calling injectedScore.
· Port Nutch benchmark to Nutchbase.
· Add parse-html back.
· MoreIndexingFilter missing date format.
· Timeout for Parser.
· Retry interval in crawl date is set to 0.
· Generate log output for solr indexer and dedup.
· Improved NutchConfiguration.
· SolrDeleteDuplicates needs to clone the SolrRecord objects.
· Native hadoop libs not available through maven.
· Separate the build and runtime environments.

Via: Apache Nutch 2.1
No comments:
Post a Comment