Job Recruitment Website - Job seeking and recruitment - One trick teaches you to use Hive to process text data.

One trick teaches you to use Hive to process text data.

after studying big data for several months, I finally got a job from my boss! With the core technology in hand, I feel that walking is much lighter. This requirement is actually very simple and clear.

Now my boss needs me to do a research on the core skills of different positions in the recruitment market. Now we probably have some data in hand. The data is some recruitment-related data, and there is a field in the data called job description. As the name implies, students who have looked for a job can know what job description means, especially those who have not studied big data. They may have looked through countless job opportunities and haven't found a job yet. Hoho, I found a job immediately after studying big data.

The job description is actually a sentence, indicating that this position requires the applicant to have certain abilities or skills to be competent for this job. With this data, my preliminary research plan is as follows:

Analyze the data in this field in a targeted way, take out all the keywords, and then count them in groups according to the positions corresponding to each data, so that I can get the number of occurrences of each keyword corresponding to each position, and then of course, the keywords that appear the most are the core skills keywords of the position. The plan is perfect.

Now all I can think about is completing the task perfectly, and then getting the boss's appreciation, promotion and salary increase, and marrying Bai Fumei. But everything is ready, except the east wind. A very important question is how to convert a bunch of texts (job descriptions) into individual words. That is, we often say word segmentation. Today we will introduce how to complete this task perfectly. Time for dry goods.

First of all, we use Hive for data processing and analysis. By querying Hive documents, we found that Hive built-in functions can't realize word segmentation, but Hive provides UDF to support user-defined functions to realize more functions. The process of developing UDF can be roughly divided into (using Java language):

Creating a Maven project for writing UDF to import related big data dependencies, and the most important thing is that hive-exec and hadoop-common create a class and inherit from the evaluate () method in the UDF class, and define logic in the method to package the Maven project. Upload the jar package to HDFS, add a method to associate the UDF class in the Hive, and then you can use this method to realize the desired function. It should be noted that the input of the evaluate () method is a piece of data, and the output is also a piece of data. Imagine that a piece of data comes in the Hive and returns a converted data after conversion, which is different from our commonly used lower()/upper (. In Hive, there are other forms of user-defined function classes, such as UDAF and UDTF, where UDAF is a multi-line input that returns one line, such as the aggregate function sum()/count (), and UDTF is a multi-line input that returns multiple lines, such as the explosion function explode (). For details, students can search and learn by themselves.

Let's start to write UDF for word segmentation. First of all, of course, it is to import related dependencies. There are many libraries to realize word segmentation. I choose IK word segmentation commonly used in java, and the name of the dependency is ikanalyzer.

After that, we can define related blacklisted words and preferred words, because we want to get the keywords we want finally, so we'd better get rid of some useless words, such as "my", "post", "very good" and so on, because we don't want to blacklist the keywords we get in the end. Of course, preference words are also necessary, because word segmentation tools use certain corpora and algorithms to segment words, and sometimes they will mistake some words, such as the word "machine learning". Maybe the word segmentation tools will divide it into two words: "machine" and "learning", but obviously for the post, this is a skill word, so we will set this proper noun as a preference word so that the word segmentation machine will not mistake these words next time.

put stopword in stopword.dic, and put the preference words in extword.dic, each word takes up one line, and then configure the paths of these two files in IKAnalyzer.cfg.xml, and then the IK word divider will automatically load our customized stopword and preference words.

next, write the main class UDF.

The general meaning of UDF is to segment the incoming string, and then splice each word with the special symbol "\1" after the segmentation, and finally return a spliced string.

according to the steps of developing UDF, type UDF into a jar package and upload it to HDFS, and create a method in Hive to associate the jar package.

In short, we use HiveSQL to complete all the tasks, and SQL won't be explained to everyone here. Finally, we got the data we wanted from the original data.

according to our data processing and analysis results, the algorithm post (job_tag) has a considerable demand for keywords (sub), "algorithm", "c++" and "machine learning", because in the description of the algorithm post, the word "algorithm" appears 6366 times and "c++" appears 376 times, "machine learning".

that's all for my sharing today. have you learned?