当前位置：网站首页>Data collection and management [4]

Data collection and management [4]

2022-06-29 03:44:00 【Star drawing question bank】

1. In incremental reptiles （） refer to ： Crawlers visit all web pages with the same frequency , Regardless of the frequency of web page changes .

A. Uniform renewal method
B. Classification based updating method
C. Random updating method
D. Individual renewal method

2. The target of collection is those that cannot be accessed through static links 、 Hidden behind the search form , Only users submit some keywords can get Web page .

A. focused crawler
B. Deep web crawler
C. General purpose web crawler
D. Incremental web crawler

3.HTTP In the request （） Fields may be similar "Mozilla/5.0(iPhone;U;CPUiPhoneOS4_3_3likeMacOSX;en-us)AppleWebKit……” Such information .

A.Cookie
B.Host
C.User-Agent
D.Connection

4. The following statement about crawling strategy , What's not right is （）.

A. Breadth first strategy can effectively control the crawling depth of the page , Avoid the problem that you can't finish crawling when you encounter an infinite deep branch
B. Depth first strategy is more suitable for vertical search or on-site search , But crawling the site with deep content level will cause a huge waste of resources .
C. The disadvantage of the depth first strategy is that it takes a long time to crawl to the pages with deep directory hierarchy .
D. Common crawling strategies for general web crawlers are ： Depth first strategy 、 Breadth first strategy

5. Data preprocessing ETL Of L Express （）.

A. cleaning
B. extract
C. load
D. transformation

6. The basic method is to follow the order of depth from low to high , Go to the next level of web page links in turn , Until we can't go any further .

A.PageRank Priority strategy
B. Random crawling strategy
C. Breadth first strategy
D. Depth first strategy

7. If you want to collect the specified data , You need to use the （）, Because it only needs to crawl the pages related to the topic , It greatly saves hardware and network resources , The saved pages are also updated quickly due to the small number .

A. Incremental web crawler
B. Deep web crawler
C. General purpose web crawler
D. focused crawler

8.HTTP in （） Method can be used to request and query the performance of the server , Or Query options and requirements related to resources .

A.TRACE
B.OPTIONS
C.PUT
D.DELETE

9. adopt HTTP perhaps HTTPS The resources requested by the protocol are () To mark .

A.TCP
B.FTP
C.URL
D.HTML

10. Our search strategy is to search in order of depth from low to high , Go to the next level of web page links in turn , Until we can't go any further , It is more suitable for vertical search or on-site search .

A. Depth first
B. Based on target characteristics
C. breadth-first
D. Based on domain

11. The following is about web crawlers , What's not right is （）.

A. Web crawler is actually a kind of " Browse the web automatically ” The program , Or a kind of internet robot
B. Web crawlers are widely used in Internet search engines or other similar websites
C. At present, most of the information classification on the Internet is done manually
D. Traditional crawlers from one or several initial web pages URL Start , Get the URL, In the process of grabbing web pages , Constantly extract new URL Put it in the queue until certain stop conditions of the system are met

12. once HTTP from （） form .

A. One request
B.2 Requests
C. One request and one response
D. One response

13. Which of the following is a deep web page

A. A web page whose content is not visible until the user registers
B. homepage
C. A static web page that hyperlinks can reach
D. Website navigation page

14. stay HTTP In request , adopt （） Method to send data , It will be placed in URL after , With ? Division URL And transmit data , Between parameters & Connected to a .

A.POST
B.GET
C.TRACE
D.PUT

15. Pursuing high data quality is an important requirement for big data , To eliminate the unpredictability of some data , Get rid of some " clutter ” Of " dirty ” data , Is involved () technology .

A. Data forecast
B. Data collection
C. Data cleaning
D. Data statistics

16. The following cannot match regular expressions "^[\d]+$" Yes. （）.

A.123
B.10
C.12
D.12abc

17. Data conversion does not include （）.

A. Inconsistent data conversion
B. Incomplete data
C. Calculation of business rules
D. Data granularity transformation

18. Regular expression rules ,（） Indicates that all non white space characters are matched

A.\d
B.\S
C.\w
D.\W

19.HTTP The response status code of the request is 403 Express （）.

A. Server connection timed out
B. The request is successful
C. server is busy
D. Access to the requested page is prohibited

20.HTTP In response （） Indicates the length of the entity body , Represented by a decimal number stored in bytes .

A.Content-Range
B.Content-Length
C.Content-Encoding
D.Content-Language

21.PageRank Priority strategies are often used to （）.

A. All web crawlers
B.DeepWeb Reptiles
C. Incremental web crawler
D. General purpose web crawler

22. The following cannot match regular expressions "^[\w]+$" Yes. （）.

A.S1
B.S+1
C.S_1
D.12

23. Data quality （） Indicates whether the data correctly represents a realistic or verifiable source .

A. Uniformity
B. completeness
C. integrity
D. correctness

24. Which of the following belongs to HTTP Request information （）.

A.User-Agent
B.Content-Length
C.Expires
D.Accept-Ranges

25.HTTP Requested （） It can make the client-side connection to the server-side continue to be effective , When there is a subsequent request to the server , Avoid establishing or re establishing connections .

A.Host
B.Referer
C.Cookie
D.Keep-Alive

26. The form filling method of deep web crawler based on Web page structure analysis generally represents the web page form as （）, Extract the value of each field of the form .

A. Images
B.BOM Trees
C.DOM Trees
D. Text

27. data （） Our task is to filter data that does not meet the requirements .

A. transformation
B. extract
C. cleaning
D. load

28. The following are not data quality specifications 、 Integrity requires

A. Relevant information of legacy system shall be consistent with other modules
B. Referential integrity is not compromised ： The data will not fail to find the reference
C. The data is internally consistent
D. There are no cross system matching violations , Data is well integrated

29. By using Web browser 、 Web crawlers or other tools , The client initiates a connection to the specified port on the server HTTP request . We call this client （）.

A. The user agent
B. The source server
C. player
D. database

30.HTTP In the request （） The content of the header field contains the requested user information .

A.User-Agent
B.Referer
C.Cookie
D.Authorization

31. stay HTP In the response message , If the status information code is 200 said （）.

A. The requested page has been moved to the new url
B. Login failed
C. The request is successful
D. Access is forbidden

32. If HTTP The requested response information is 404, Which of the following measures should be taken （）.

A. Check again if the requested page address is correct
B. Report the fault to the network management
C. Check browser permissions
D. Ask the administrator for a user name and password

33. Regular expressions [a-z] Can match ().

A."a” To "z” Any lowercase character in the range
B. Alphabetic character "a” or "z”
C."a” To "z” Any alphabetic character in the range
D. Lowercase characters "a” or "z”

34. The following does not belong to HTTP The protocol request method is

A.TRACE
B.SUBMIT
C.POST
D.GET

35.HTTP Requested POST Method requests the server to store a resource , And use Request-URI As its logo .

A. Yes
B. wrong

36.HTTP In the request Range The header field contains the information of the user who made the request .

A. Yes
B. wrong

37. The basic method of depth first strategy is in the order of depth from low to high , Go to the next level of web page links in turn .

A. Yes
B. wrong

38. Network data collection cannot process unstructured data .

A. Yes
B. wrong

39.HTTP Status code 500 Indicates that the request failed due to the client .

A. Yes
B. wrong

40. Focus on the order in which web crawlers crawl pages is relatively low .

A. Yes
B. wrong

41.Web The server does not save the file to send the request Web Any information about the browser process .

A. Yes
B. wrong

42. For systems with large amounts of data , Generally, one-time data extraction is also often done .

A. Yes
B. wrong

43.HTTP In the request Cookie Indicates the client type .

A. Yes
B. wrong

44. Focused crawlers are also called topic crawlers .

A. Yes
B. wrong

45. Surface pages （SurfaceWeb） The middle accessible information capacity is a deep web page （DeepWeb

A. Yes
B. wrong

46.HTTP In request Referer The content of the header field contains the requested user information .

A. Yes
B. wrong

47. Hunger patterns in regular expressions match as much text as possible .

A. Yes
B. wrong

48. The specific performance of data quality is correctness 、 integrity 、 Uniformity 、 completeness 、 effectiveness 、 Timeliness and availability, etc .

A. Yes
B. wrong

49. Web crawler is actually a kind of " Browse the web automatically ” The program .

A. Yes
B. wrong

50.HTTP Requested TRACE Method to query the performance of the server , Or Query options and requirements related to resources .

A. Yes
B. wrong

51. Depth first strategy is more suitable for vertical search or on-site search , But crawling the site with deep content level will cause a huge waste of resources .

A. Yes
B. wrong

52. Regular expressions consist of some ordinary characters and some metacharacters .

A. Yes
B. wrong

53.HTTP Requested PUT Method requests the server to send back the received request information , Mainly used for testing or diagnosis .

A. Yes
B. wrong

54. The amount of information stored in deep web pages only accounts for a small part of the amount of information in the Internet .

A. Yes
B. wrong

55. To focus on the crawler, you only need to crawl the pages related to the theme ..

A. Yes
B. wrong

56.POST Limited data size submitted , At best 1024 byte .

A. Yes
B. wrong

57.HTTP The request header field may contain Authorization、Referer、Content-Type、Content-Encoding Other parts .

A. Yes
B. wrong

58.HTTP In the request Range A header field can request one or more subscopes of an entity .

A. Yes
B. wrong

59. Breadth first strategy crawls the page according to the depth of the content directory , Pages at a shallow directory level are crawled first .

A. Yes
B. wrong

60. Data cleaning is a one-time process .

A. Yes
B. wrong

61. Web crawlers always start from a certain starting point , This starting point is called seed .

A. Yes
B. wrong

62. General web crawler requires high crawling speed and storage space .

A. Yes
B. wrong

63. Regular expressions support matching boundaries , For example, word boundaries , The beginning or end of a text .

A. Yes
B. wrong

64. stay HTTP Status information code is an important information in response information .

A. Yes
B. wrong

65. In regular expressions $ Match the beginning of the line .

A. Yes
B. wrong

66. Crawler tools can only be used Java Language writing .

A. Yes
B. wrong

67.HTTP Requested OPTIONS Method to query the performance of the server , Or Query options and requirements related to resources .

A. Yes
B. wrong

68. The actual network crawler system is usually a combination of several crawler technologies .

A. Yes
B. wrong

69.HTTP Status code 400 Indicates that the request was successful ..

A. Yes
B. wrong

70. Can be from a HTTP I learned some information from the request , for example ： The requesting client , The language of the request , Keep connected （keep-alive）, wait .

A. Yes
B. wrong

71.HTTP In the request method DELETE Used to request the server to delete Request-URI Identified resources .

A. Yes
B. wrong

72.Java The language does not support regular expressions ..

A. Yes
B. wrong

73. Greedy patterns in regular expressions match as much text as possible .

A. Yes
B. wrong

74. Correctness of data quality （Accuracy） It refers to whether the data correctly represents a realistic or verifiable source .

A. Yes
B. wrong

75.GET The request for Request-URI Identified resources .

A. Yes
B. wrong

76. The common crawling strategies for focused web crawlers are depth first 、 Breadth first strategy .

A. Yes
B. wrong

77. Surface pages are those that most content cannot be accessed through static links 、 Web pages hidden behind search forms .

A. Yes
B. wrong

78. Deep web pages contain far less information than surface web pages , There is no crawling value .

A. Yes
B. wrong

79. Web crawlers are widely used in Internet search engines or other similar websites , In order to obtain or update the content and retrieval methods of these websites .

A. Yes
B. wrong

80. At present, most of the information classification on the Internet is done manually .

A. Yes
B. wrong

81. Quantifiers of regular expressions + Express 0 Times or times .

A. Yes
B. wrong

82. In regular expressions $ Match the end of the line .

A. Yes
B. wrong

83. Incremental web crawler data download volume and time and space consumption are large .

A. Yes
B. wrong

84. Hypertext transfer protocol is usually composed of HTTP Client initiates a request , Create a... To the specified port of the server TCP Connect .

A. Yes
B. wrong

85.HTTP Responses include Content-Encoding、Content-Length、Content-Type etc. .

A. Yes
B. wrong

86. The number of deep web pages is much less than that of surface web pages .

A. Yes
B. wrong

87. Focus on web crawlers vs. general web crawlers , Added link evaluation module and content evaluation module .

A. Yes
B. wrong

88.DELETE Request server delete Request-URI Identified resources .

A. Yes
B. wrong

89. The timeliness of data quality refers to whether the data is within the acceptable range defined by the enterprise .

A. Yes
B. wrong

90. Depth first strategy is more suitable for vertical search or on-site search .

A. Yes
B. wrong

91. Different enterprises have different business rules 、 Different data indicators , These indicators are simply added and subtracted 、 Combination can be completed .

A. Yes
B. wrong

92. Hunger patterns in regular expressions match as little text as possible .

A. Yes
B. wrong

93. Universal web crawler is suitable for searching a wide range of topics for search engines , It has strong application value .

A. Yes
B. wrong

94. Regular expressions [abc] Representation string abc.

A. Yes
B. wrong

95.HTTP Status code 500 Indicates that the request failed due to the server .

A. Yes
B. wrong

96. General purpose web crawlers usually work in parallel , But it takes a long time to refresh the page .

A. Yes
B. wrong

97.HTTP The common request methods are GET、HEAD.POST.