当前位置:网站首页>Nutch2.1 using eclipse debug to store the build process in MySQL on the windows platform
Nutch2.1 using eclipse debug to store the build process in MySQL on the windows platform
2022-06-29 20:03:00 【Brother Xing plays with the clouds】
step 1: Get ready eclipse、eclipse svn plug-in unit 、MySQL Get ready ,mysql Use utf-8 code step 2:mysql Building database , Build table : CREATE DATABASE nutch ; CREATE TABLE `webpage` ( `id` varchar(767) NOT NULL, `headers` blob, `text` mediumtext DEFAULT NULL, `status` int(11) DEFAULT NULL, `markers` blob, `parseStatus` blob, `modifiedTime` bigint(20) DEFAULT NULL, `score` float DEFAULT NULL, `typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL, `baseUrl` varchar(767) DEFAULT NULL, `content` longblob, `title` varchar(2048) DEFAULT NULL, `reprUrl` varchar(767) DEFAULT NULL, `fetchInterval` int(11) DEFAULT NULL, `prevFetchTime` bigint(20) DEFAULT NULL, `inlinks` mediumblob, `prevSignature` blob, `outlinks` mediumblob, `fetchTime` bigint(20) DEFAULT NULL, `retriesSinceFetch` int(11) DEFAULT NULL, `protocolStatus` blob, `signature` blob, `metadata` blob, PRIMARY KEY (`id`) ) ENGINE=InnoDB ROW_FORMAT=COMPRESSED DEFAULT CHARSET=utf8mb4;
`id` varchar(767) NOT NULL This is not successful on my computer , Can only be set to a maximum of 100 So instead of :`id` varchar(100) NOT NULL step 3: from https://svn.apache.org/repos/asf/nutch/tags/release-2.1 Pull down the code , Create locally Java project. I have tried many times , Therefore, the name of the project is test. step 4: Add src file stay project explorer Right click the item below , choice properties. Get into java build path , stay source tab , Delete src Folder , choice “Add Folder ”, Put... Here conf,src/bin,src/java,src/test,src/testresources, as well as src/plugin Of each plug-in under the folder src and test Join in . Finally, you can see the following interface (test Is the project name ):
At every eclipse There is... Under the project folder .classpath file , open .classpath The file can see : The content is basically like this . <classpathentry kind="src" path="conf"/> <classpathentry kind="src" path="src/java"/> <classpathentry kind="src" path="src/test"/> <classpathentry kind="src" path="src/plugin/protocol-file/src/test"/> <classpathentry kind="src" path="src/plugin/protocol-httpclient/src/test"/> <classpathentry kind="src" path="src/plugin/subcollection/src/test"/> <classpathentry kind="src" path="src/plugin/parse-html/src/test"/> <classpathentry kind="src" path="src/plugin/urlfilter-automaton/src/test"/> <classpathentry kind="src" path="src/plugin/parse-html/src/java"/> <classpathentry kind="src" path="src/plugin/parse-tika/src/test"/> <classpathentry kind="src" path="src/plugin/lib-http/src/test"/> <classpathentry kind="src" path="src/plugin/parse-tika/src/java"/> <classpathentry kind="src" path="src/plugin/urlfilter-regex/src/java"/> <classpathentry kind="src" path="src/plugin/urlfilter-domain/src/java"/> <classpathentry kind="src" path="src/plugin/scoring-link/src/java"/> <classpathentry kind="src" path="src/plugin/index-anchor/src/test"/> <classpathentry kind="src" path="src/plugin/protocol-http/src/java"/> <classpathentry kind="src" path="src/plugin/urlnormalizer-regex/src/test"/> <classpathentry kind="src" path="src/plugin/urlfilter-prefix/src/java"/> <classpathentry kind="src" path="src/plugin/scoring-opic/src/java"/> <classpathentry kind="src" path="src/plugin/urlfilter-domain/src/test"/> <classpathentry kind="src" path="src/plugin/protocol-file/src/java"/> <classpathentry kind="src" path="src/plugin/urlnormalizer-regex/src/java"/> <classpathentry kind="src" path="src/plugin/urlfilter-suffix/src/java"/> <classpathentry kind="src" path="src/plugin/language-identifier/src/java"/> <classpathentry kind="src" path="src/plugin/lib-regex-filter/src/test"/> <classpathentry kind="src" path="src/plugin/language-identifier/src/test"/> <classpathentry kind="src" path="src/plugin/subcollection/src/java"/> <classpathentry kind="src" path="src/plugin/urlnormalizer-basic/src/test"/> <classpathentry kind="src" path="src/plugin/index-basic/src/java"/> <classpathentry kind="src" path="src/plugin/urlnormalizer-pass/src/test"/> <classpathentry kind="src" path="src/plugin/creativecommons/src/java"/> <classpathentry kind="src" path="src/bin"/> <classpathentry kind="src" path="src/plugin/protocol-httpclient/src/java"/> <classpathentry kind="src" path="src/plugin/tld/src/java"/> <classpathentry kind="src" path="src/plugin/urlnormalizer-basic/src/java"/> <classpathentry kind="src" path="src/plugin/index-basic/src/test"/> <classpathentry kind="src" path="src/plugin/lib-http/src/java"/> <classpathentry kind="src" path="src/plugin/protocol-ftp/src/java"/> <classpathentry kind="src" path="src/plugin/index-anchor/src/java"/> <classpathentry kind="src" path="src/plugin/urlfilter-validator/src/java"/> <classpathentry kind="src" path="src/plugin/index-more/src/java"/> <classpathentry kind="src" path="src/plugin/urlfilter-suffix/src/test"/> <classpathentry kind="src" path="src/plugin/creativecommons/src/test"/> <classpathentry kind="src" path="src/plugin/microformats-reltag/src/java"/> <classpathentry kind="src" path="src/plugin/urlfilter-regex/src/test"/> <classpathentry kind="src" path="src/plugin/lib-regex-filter/src/java"/> <classpathentry kind="src" path="src/plugin/index-more/src/test"/> <classpathentry kind="src" path="src/plugin/urlnormalizer-pass/src/java"/> <classpathentry kind="src" path="src/plugin/urlfilter-automaton/src/java"/> <classpathentry kind="src" path="src/testresources"/>
step 5: Join in lib package : Switch to Libaries tab ,“Add Library"->"IvyDE Managed Dependencies"->"Next", choice “Project”, choice ivy\ivy.xml file . spot Ok.eclipse It will automatically download the dependent jar package .
Errors may be reported during this process , See the error message because org.restlet.jse Package cannot be downloaded . The solution is :ivy\ivy.xml Find <dependency org="org.restlet.jse" name="org.restlet" rev="2.0.5" conf="*->default" /> <dependency org="org.restlet.jse" name="org.restlet.ext.jackson" rev="2.0.5" conf="*->default" /> part , Comment out . Manually find these two packages on the Internet , Put it in lib It's a bag , Add to Libaries in .
Then add plugin Of each plug-in under the folder ivy.xml file . Manually add one by one .
step 6: stay "Order and Export" tab , take conf top step 7: Database configuration and other configuration information open /conf/gora.properties , Delete everything in the file , write in mysql To configure : ############################### # MySQL properties # ############################### gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true gora.sqlstore.jdbc.user=root gora.sqlstore.jdbc.password=123456
stay /conf/gora-sql-mapping.xml modify <primarykey column="id" length="240"/> stay /conf/nutch-site.xml Input : <property> <name>http.agent.name</name> <value>Your Nutch Spider</value> </property>
<property> <name>http.accept.language</name> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value> <description>Value of the “Accept-Language” request header field. This allows selecting non-English language as default one to retrieve. It is a useful setting for search engines build for certain national group. </description> </property>
<property> <name>parser.character.encoding.default</name> <value>utf-8</value> <description>The character encoding to fall back to when no other information is available</description> </property>
<property> <name>plugin.includes</name> <value>protocol-httpclient|protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. </description> </property>
<property> <name>storage.data.store.class</name> <value>org.apache.gora.sql.store.SqlStore</value> <description>The Gora DataStore class for storing and retrieving data. Currently the following stores are available: …. </description> </property>
<property> <name>plugin.folders</name> <value>./src/plugin</value> <description>Directories where nutch plugins are located. Each element may be a relative or absolute path. If absolute, it is used as is. If relative, it is searched for on the classpath.</description> </property>
In the root directory build.xml Find the following code in
<target name="resolve-default" depends="clean-lib, init" description="--> resolve and retrieve dependencies with ivy"> <ivy:resolve file="${ivy.file}" conf="default" log="download-only" /> <ivy:retrieve pattern="${build.lib.dir}/[artifact]-[revision].[ext]" symlink="false" log="quiet" /> <antcall target="copy-libs" /> </target> take pattern="${build.lib.dir}/[artifact]-[revision].[ext]" Replace with pattern="${build.lib.dir}/[artifact]-[type]-[revision].[ext]" step 8: Configuration grab url stay test Create a folder under the project urls, stay urls Create a file seeds.txt , Write the website you want to crawl . What I'm writing is http://www.163.com. step 9: function org.apache.nutch.crawl.Crawler open Crawler file ,“Run As” -> “Run Configurations” , stay “Arguments” Tab “Program Arguments”, Input “urls -depth 3 -topN 5”, spot "Run". ha-ha , Wrong report . The error message is similar to “ Failed to set permissions of path: \tmp\Hadoop-Administrator\mapred\staging\Administrator1712398257\. ” Error of . This is a hadoop A question of . The solution is , modify /hadoop-1.0.2/src/core/org/apache/hadoop/fs/FileUtil.java Inside checkReturnValue, Just comment it out . Of course, the easiest way is to find a modified package on the Internet , Replace FileUtil.class. Run again , ha-ha This is the end of the successful execution .
Good luck to you .
Problems encountered : 1 newspaper Exception in thread "main" java.lang.RuntimeException: job failed: name=parse, jobid=job_local_0004 There may be many problems found on the Internet. First of all nutch-default.xml Middle configuration <name>plugin.folders</name><value>./src/plugin</value> Second, look for hadoop.log file .
边栏推荐
- Zotero期刊自动匹配更新影响因子
- Sentinel的快速入门,三分钟带你体验流量控制
- thinkphp5中的配置如何使用
- Connaissance générale des paramètres de sécurité du serveur Cloud
- Luoqingqi: has high-end household appliances become a red sea? Casati took the lead in breaking the game
- NLP - giza++ implements word alignment
- Following the crowd hurts you
- Summary of swift optional values
- JVM (4) bytecode technology + runtime optimization
- 画虎国手孟祥顺数字藏品限量发售,随赠虎年茅台
猜你喜欢

data link layer

Detailed description of gaussdb (DWS) complex and diverse resource load management methods

Koa source code analysis

Flume theory

Withdrawal of user curve in qualified currency means loss

ASP. Net core creates razor page and uploads multiple files (buffer mode) (Continued)

Linux Installation mysql8

一个超赞的开源的图片去水印解决方案

Finally, Amazon~

通过MeterSphere和DataEase实现项目Bug处理进展实时跟进
随机推荐
从众伤害的是自己
畫虎國手孟祥順數字藏品限量發售,隨贈虎年茅臺
Configuration du Flume 4 - source personnalisée + sink
Sword finger offer 66 Building a product array
JVM (4) bytecode technology + runtime optimization
NLP - GIZA++ 实现词对齐
Notepad++--宏(记录操作过程)
[sword finger offer] 51 Reverse pair in array
社区访谈丨一个IT新人眼中的JumpServer开源堡垒机
Flume theory
Snowflake ID, distributed unique ID
Introduction to the latest version 24.1.0.360 update of CorelDRAW
Notepad++ -- macro (record operation process)
Chapter II (physical layer)
7. cancellation and closing
[notes] take notes again -- learn by doing Verilog HDL – 008
Hangfire详解
Flume-ng配置
Tiger painter mengxiangshun's digital collection is on sale in limited quantities and comes with Maotai in the year of the tiger
雪花id,分布式唯一id