INTERNET AND TECHNOLOGY REVIEW: Apache Nutch 2.1


Developer: Website: License / Price: Platforms: Databases: Language: Last Updated: Category:	Apache Software Foundation \| More scripts nutch.apache.org Apache License Windows / Linux / Mac OS / BSD / Solaris N/A Java October 9th, 2012, 15:57 GMT [view history] C: \ Search Engines

It builds on Lucene Java, adding new web-specifics, such as parsers for HTML, a crawler, a link-graph database and other document formats.

Nutch can run on a single machine, but works better in Hadoop clusters.

Plugins are available for expanding its usage spectrum.

What's New in This Release: [ read full changelog ]

· Renamed HTMLParseFilter into ParseFilter.
· Remove remaining robots/IP blocking code in lib-http.
· Port logging to slf4j.
· External parser supports encoding attribute.
· Ivy configuration settings don't include Gora.
· Injector should add the metadata before calling injectedScore.
· Port Nutch benchmark to Nutchbase.
· Add parse-html back.
· MoreIndexingFilter missing date format.
· Timeout for Parser.
· Retry interval in crawl date is set to 0.
· Generate log output for solr indexer and dedup.
· Improved NutchConfiguration.
· SolrDeleteDuplicates needs to clone the SolrRecord objects.
· Native hadoop libs not available through maven.
· Separate the build and runtime environments.

Via: Apache Nutch 2.1

INTERNET AND TECHNOLOGY REVIEW

Pages - Menu

09 October, 2012

Apache Nutch 2.1

No comments:

Post a Comment