Job Recruitment Website - Zhaopincom - Python crawl

Python crawl

1, basic principle of web crawler

The traditional crawler starts with the URL of one or several initial web pages and obtains the URL on the initial web pages. In the process of crawling web pages, it constantly extracts new URLs from the current page and puts them in the queue until the system meets certain requirements.

Stop condition. The workflow of focused crawler is complex, so it is necessary to filter out links irrelevant to the topic according to certain web page analysis algorithm, keep useful links and put them in URL queue for crawling. Then, it will take root and sprout

According to a certain search strategy, the URL of the next page to be crawled is selected from the queue, and the above process is repeated until a certain condition of the system is reached.

2. Basic design concept

As you said, first go to the Weibo login page to simulate login, grab the page, find out all the URLs from the page, select the text descriptions of the URLs that meet the requirements, simulate clicking on these URLs, and repeat the above-mentioned crawling actions until you meet the requirements and exit.

3. Existing projects

There is a project on the google project website called sinawler, which is a special Sina Weibo crawler to grab Weibo content. You can't go online, you know. However, you can check Baidu's "Sina Weibo Crawler Written in python (see New Weibo for the current login method)", and you can find the reference source code, which is written in python2. If you write in python3, you can actually use urllib.request to simulate and build a browser with cookies, which saves the processing of cookies and makes the code shorter.

4. In addition,

Look at the Baidu Encyclopedia of Web Crawlers, which contains many in-depth contents, such as algorithm analysis and strategy system, which will be of great help and theoretically improve the technical level of the code.