当前位置:网站首页>Data collection and management [6]
Data collection and management [6]
2022-06-29 03:44:00 【Star drawing question bank】
1.() The target of collection is those that cannot be accessed through static links 、 Hidden behind the search form , Only users submit some keywords can get Web page .
A. focused crawler
B. Incremental web crawler
C. General purpose web crawler
D. Deep web crawler
2.HTTP Requested () The method is in Request-URI New data is appended to the identified resource .
A.GET
B.POST
C.PUT
D.TRACE
3. stay HTTP In request , adopt () Method to send data , It will be placed in URL after , With ? Division URL And transmit data , Between parameters & Connected to a .
A.GET
B.POST
C.PUT
D.TRACE
4. Regular expression rules ,() Indicates that all non white space characters are matched
A.\S
B.\d
C.\W
D.\w
5. In regular expressions ,() Matches any character except the newline character .
A.^
B.\d
C..
D.\w
6. The following statement about the rule of quantity representation in regular expressions , What's not right is ()
A.X+ Can appear 0 Time ,1 Times or times
B.X Indicates that it must occur once
C.X* Can appear 0 Time ,1 Times or times
D.X+ Can appear 1 Times or times
7. The causes of quality problems do not include ().
A. The time difference of data
B. The diversity of data acquisition methods
C. Data instability
D. Data dependency
8.HTTP In response () And so on are used to specify the time of message sending and document expiration respectively .
A.Date,Expires
B.Date,Allow
C.Last-Modified,Allow
D.Last-Modified,Expires
9. The form filling method of deep web crawler based on Web page structure analysis generally represents the web page form as (), Extract the value of each field of the form .
A.DOM Trees
B.BOM Trees
C. Images
D. Text
10. Data conversion does not include ().
A. Inconsistent data conversion
B. Data granularity transformation
C. Calculation of business rules
D. Incomplete data
11. Network data collection is generally through () Or website public API And other ways to get data information from the website .
A. Web crawler
B. Website log
C.HTTP
D. Forms
12. Data () It refers to whether the data is easy to obtain 、 Easy to understand and use .
A. completeness
B. Su shi
C. Guan Hanqing
D. Li qingzhao
13.() The most important part of the crawling process is form filling and processing .
A. focused crawler
B. Incremental web crawler
C. General purpose web crawler
D. Deep web crawler
14. according to () The crawled web page content crawls the page according to the depth of the directory , Pages at a shallow directory level are crawled first , When the pages in the same level have finished crawling , The crawler goes further and continues to crawl .
A. Depth first strategy
B. Breadth first strategy
C.PageRank Priority strategy
D. Random crawling strategy
15. Data preprocessing ETL Of L Express ().
A. extract
B. transformation
C. load
D. cleaning
16.、HTTP in () Method can be used to request and query the performance of the server , Or Query options and requirements related to resources .
A.OPTIONS
B.DELETE
C.PUT
D.TRACE
17. once HTTP from () form .
A. One request
B. One response
C. One request and one response
D.2 Requests
18. Regular expressions support matching boundaries . for example ,() Match the beginning of the line ..
A.^
B.\d
C.\w
D.$
19.() The two main goals are to keep the pages stored in the local page set up to date and to improve the quality of the pages in the local page set .
A. focused crawler
B. Incremental web crawler
C. General purpose web crawler
D. Deep web crawler
20.() Our search strategy is to search in order of depth from low to high , Go to the next level of web page links in turn , Until we can't go any further , It is more suitable for vertical search or on-site search .
A. breadth-first
B. Depth first
C. Based on target characteristics
D. Based on domain
21. data () Our task is to filter data that does not meet the requirements .
A. extract
B. transformation
C. load
D. cleaning
22. About surface web pages and deep web pages , The following statement is not true ().
A. Surface page refers to the pages that can be indexed by traditional search engines , It is mainly composed of static web pages that can be reached by hyperlinks Web page .
B. Deep web pages are those where most of the content cannot be accessed through static links 、 Hidden behind the search form , Only users submit some keywords can get Web page .
C. Deep web pages contain far less information than surface web pages .
D. Deep web crawlers are mainly used to crawl deep web pages behind search forms .
23. For issued HTTP Some resources are stored on the server where the request is answered , such as HTML Files and images . We call this answering server ().
A. browser
B. player
C. The user agent
D. The source server
24. Data preprocessing ETL Of E Express ().
A. extract
B. transformation
C. load
D. cleaning
25.HTTP Request header () The contents of the domain contain information about the user who made the request , For example, the client name and version number used .
A.Host
B.User-Agent
C.Cookie
D.Referer
26.() Describes a pattern of string matching , Usually used to retrieve 、 Replace those that match a pattern ( The rules ) The text of .
A. Web crawler
B. Data collection
C. Character set
D. Regular expressions
27.GET and POST The difference between , The following statement is incorrect ().
A.GET The submitted data will be placed in URL after
B.POST The submitted data will be placed in URL Later
C.GET Method needs to be used Request.QueryString To get the value of a variable
D.POST Way through Request.Form To get the value of a variable
28. adopt HTTP perhaps HTTPS The resources requested by the protocol are () To mark .
A.HTML
B.URL
C.TCP
D.FTP
29. stay HTTP In the response message , If the status information code is 404 said ().
A. The requested page was not found
B. Login failed
C. The requested page has been moved to the new url
D. Access is forbidden
30. Which of the following belongs to HTTP Request information ().
A.User-Agent
B.Content-Length
C.Accept-Ranges
D.Expires
31. The following does not belong to HTTP The protocol request method is ()
A.GET
B.POST
C.TRACE
D.SUBMIT
32. Data cleaning is a one-time process .(1 branch )
33. The timeliness of data quality refers to whether the data is within the acceptable range defined by the enterprise .(1 branch )
34. The disadvantage of the depth first strategy is that it takes a long time to crawl to the pages with deep directory hierarchy .(1 branch )
35.HTTP Requested TRACE Method requests the server to send back the received request information , Mainly used for testing or diagnosis .(1 branch )
36.HTTP Response Content-Type The natural language used to indicate the object of the response .(1 branch )
37.HTTP Requested PUT Method requests the server to store a resource , And use Request-URI As its logo .(1 branch )
38. Surface pages are those that most content cannot be accessed through static links 、 Web pages hidden behind search forms .(1 branch )
39. Web crawlers always start from a certain starting point , This starting point is called seed .(1 branch )
40. Hunger patterns in regular expressions match as much text as possible .(1 branch )
41. Web crawler technology does not support images 、 Audio 、 Video and other files or attachments .(1 branch )
42. Quantifiers of regular expressions + Express 0 Times or times .(1 branch )
43. The focused crawler filters the links irrelevant to the topic according to a certain web page analysis algorithm , Keep the useful links and put them in the waiting URL queue .(1 branch )
44. In regular expressions \d Represents any word character .(1 branch )
45. Surface page refers to the pages that can be indexed by traditional search engines .(1 branch )
46. The widespread use of web crawlers may cause personal privacy disclosure .(1 branch )
47.HTTP Status code 500 Indicates that the request failed due to the client .(1 branch )
48. Hunger patterns in regular expressions match as little text as possible .(1 branch )
49. Web crawlers are widely used in Internet search engines or other similar websites , In order to obtain or update the content and retrieval methods of these websites .(1 branch )
50. General web crawler requires high crawling speed and storage space .(1 branch )
51.HTTP In the request Range A header field can request one or more subscopes of an entity .(1 branch )
52. Regular expressions [abc] Representation string abc.(1 branch )
53.HTTP In the request Keep-Alive The function makes the connection between the client and the server persistent .(1 branch )
54. Regular expressions cannot match special characters .(1 branch )
55. In regular expressions , Quantifiers can match multiple occurrences of an expression .(1 branch )
56. A web crawler crawls along a web page and its hyperlinks , Every time you get to a web page, grab it with a grab program , Extract the content , Extract hyperlinks at the same time , As a clue to further crawling .(1 branch )
57. Breadth first strategy can not avoid the problem that crawling cannot end when encountering an infinite deep branch .(1 branch )
58.Java、Python And other languages also support regular expressions .(1 branch )
59. Surface web pages are mainly composed of static web pages that can be reached by hyperlinks Web page .(1 branch )
60. For systems with large amounts of data , Generally, one-time data extraction is also often done .(1 branch )
61.HTTP In the request Cookie Indicates the client type .(1 branch )
62. Regular expressions are often used for retrieval 、 Replace those that match a pattern ( The rules ) The text of .(1 branch )
63. Data conversion mainly involves inconsistent data conversion 、 Data granularity transformation , And the calculation of some business rules .(1 branch )
64.HTTP Requested DELETE Method to request the server to delete Request-URI Identified resources .(1 branch )
65. Surface pages (SurfaceWeb) It's the biggest on the Internet 、 The fastest growing new type of information resources .(1 branch )
66. At present, the mainstream web development languages do not support regular expressions ..(1 branch )
67. Regular expressions [abc] According to the character a or b or c.(1 branch )
68. Web crawlers can automatically collect all the page content they can access .(1 branch )
69.GET The request for Request-URI Identified resources .(1 branch )
70.HTTP Requested TRACE Method to query the performance of the server , Or Query options and requirements related to resources .(1 branch )
71. Regular expressions consist of some ordinary characters and some metacharacters .(1 branch )
72. Correctness of data quality (Accuracy) It refers to whether the data correctly represents a realistic or verifiable source .(1 branch )
73. Data conversion , Handling of null values , You can load or replace it with other meaningful data , According to the null value of the field, it can be loaded into different target libraries .(1 branch )
74.HTTP Requested PUT Method to request the server to delete Request-URI Identified resources .(1 branch )
75.HTTP Response Content-Type The type and character set of the object that represents the response .(1 branch )
76.HTTP Requested Keep-Alive No negative impact on the server .(1 branch )
77.HTTP Status code 200 Indicates that the request was successful .(1 branch )
78.HTTP Status code 500 Indicates that the request failed due to the server .(1 branch )
79. General purpose web crawlers usually work in parallel , But it takes a long time to refresh the page .(1 branch )
80. Greedy patterns in regular expressions match as little text as possible .(1 branch )
81.HTTP The request header field may contain Authorization、Referer、Content-Type、Content-Encoding Other parts .(1 branch )
82. With the rapid development of network , Constantly optimized web crawler technology is effectively addressing various challenges , It provides strong support for efficient search of specific fields and topics concerned by users .(1 branch )
83.HTTP In request Referer The content of the header field contains the requested user information .(1 branch )
84. Focus on the order in which web crawlers crawl pages is relatively low .(1 branch )
85. Breadth first strategy crawls the page according to the depth of the content directory , Pages at a shallow directory level are crawled first .(1 branch )
86. Business systems generally store very detailed data , So in general , Business system data will be aggregated according to data warehouse granularity .(1 branch )
87. Use a square bracket in a regular expression [] Indicates character classification .(1 branch )
88.DeepWeb The most important part of crawler crawling is link extraction .(1 branch )
89. Depth first strategy is more suitable for vertical search or on-site search .(1 branch )
90. The amount of information stored in deep web pages only accounts for a small part of the amount of information in the Internet .(1 branch )
91. Small sites will have no crawler access .(1 branch )
92. The specific performance of data quality is correctness 、 integrity 、 Uniformity 、 completeness 、 effectiveness 、 Timeliness and availability, etc .(1 branch )
93. Crawler tools can only be used Java Language writing .(1 branch )
94. Regular expressions support matching boundaries , For example, word boundaries , The beginning or end of a text .(1 branch )
95.HTTP Requested POST Method requests the server to store a resource , And use Request-URI As its logo .(1 branch )
96.HTTP A protocol is a stateless protocol .(1 branch )
97. Incremental web crawlers only crawl newly generated or updated pages when needed .(1 branch )
98. In addition to collecting information, web crawlers , You can even implant rogue software , Destroy web content and even hijack websites and servers .(1 branch )
99. stay HTTP Status information code is an important information in response information .(1 branch )
100. After a web crawler crawls various resources , Organize this information through corresponding indexing techniques , Provide to the user for query .(1 branch )
边栏推荐
- [MCU framework][dfu] DFU upgrade example with CRC verification + timeout mechanism +led indicator + chip locking + chip self erasure
- SSH login without password
- 87.(cesium篇)cesium热力图(贴地形)
- 【TcaplusDB知识库】TcaplusDB-tcapsvrmgr工具介绍(一)
- Linear and nonlinear structures
- [tcapulusdb knowledge base] Introduction to tcapulusdb tcapsvrmgr tool (I)
- Devops note-05: what are the roles of Ba, SM, Po, PM, PD, dev, OPS and QA in the IT industry
- Whose encryption key is written in the code? (that's you)
- Probe into metacosmic storage, the next explosive point in the data storage market?
- Deeply analyzing the business logic of "chain 2+1" mode
猜你喜欢
![Zigzag sequence traversal of binary tree [one of layered traversal methods - > preorder traversal +level]](/img/f6/0df9f2a454cea0a95a5347546a90fb.png)
Zigzag sequence traversal of binary tree [one of layered traversal methods - > preorder traversal +level]

分布式id解决方案

leetcode:560. 和为 K 的子数组

Tupu software intelligent energy integrated management and control platform

Whose encryption key is written in the code? (that's you)
![二叉树的层序遍历 II[层序遍历方式之一 ->递归遍历 + level]](/img/f9/efb73dd6047e6d5833581376904788.png)
二叉树的层序遍历 II[层序遍历方式之一 ->递归遍历 + level]
![[tcapulusdb knowledge base] modify business modify cluster](/img/a6/e8067809c8a5d8a222b0ad0d545591.png)
[tcapulusdb knowledge base] modify business modify cluster

MobileOne: 移动端仅需1ms的高性能骨干
![[data update] NPU development data based on 3568 development board is fully upgraded](/img/c3/fdbaed3ebfe2dcdeee66e509598005.jpg)
[data update] NPU development data based on 3568 development board is fully upgraded

Grafana入门教程
随机推荐
Laravel v. about laravel using the pagoda panel to connect to the cloud database (MySQL)
如何理解MySQL的索引?
二叉树的层序遍历 II[层序遍历方式之一 ->递归遍历 + level]
Efficientnetv2 - get smaller models and faster training with NAS, scaling, and fused mbconv
[tcapulusdb knowledge base] batch copy the game area
PHP实现 mqtt通信
High performance current limiter guava ratelimiter
[ruoyi] ztree initialization
【TcaplusDB知识库】批量复制游戏区
Seura 2测试代码总结
【TcaplusDB知识库】TcaplusDB-tcapsvrmgr工具介绍(三)
87. (cesium chapter) cesium thermal map (pasted with terrain)
【TcaplusDB知识库】TcaplusDB-tcaplusadmin工具介绍
What is the gold content of the equipment supervisor certificate? Is it worth it?
Access 500 error after modstart migrates the environment
Open source demo| you draw and I guess -- make your life more interesting
[tcapulusdb knowledge base] Introduction to tcapulusdb tcapsvrmgr tool (I)
分享 60 个神级 VS Code 插件
Set hardware breakpoint instruction for ejtag under the PMON of the Godson development board
Go implements distributed locks