intro.ipynb
11.7 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# simplifyweibo_4_moods 说明\n",
"0. **下载地址:** [百度网盘](https://pan.baidu.com/s/16c93E5x373nsGozyWevITg)\n",
"1. **数据概览:** 36 万多条,带情感标注 新浪微博,包含 4 种情感,其中喜悦约 20 万条,愤怒、厌恶、低落各约 5 万条\n",
"2. **推荐实验:** 情感/观点/评论 倾向性分析\n",
"2. **数据来源:** [新浪微博](https://weibo.com/)\n",
"3. **原数据集:** [微博情感分析数据集](https://download.csdn.net/download/turkan/9181661),网上搜集,具体作者、来源不详\n",
"4. **加工处理:**\n",
" 1. 将原来的 4 份文档,整合成 1 份 csv 文件\n",
" 2. 原始语料进行了分词处理,我们重新将其还原为未分词的语料\n",
" 3. 编码统一为 UTF-8\n",
" 4. 去重"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"path = 'simplifyweibo_4_moods_文件夹_所在_路径'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. simplifyweibo_4_moods.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 加载数据"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"微博数目(总体):361744\n",
"微博数目(喜悦):199496\n",
"微博数目(愤怒):51714\n",
"微博数目(厌恶):55267\n",
"微博数目(低落):55267\n"
]
}
],
"source": [
"pd_all = pd.read_csv(path + 'simplifyweibo_4_moods.csv')\n",
"moods = {0: '喜悦', 1: '愤怒', 2: '厌恶', 3: '低落'}\n",
"\n",
"print('微博数目(总体):%d' % pd_all.shape[0])\n",
"\n",
"for label, mood in moods.items(): \n",
" print('微博数目({}):{}'.format(mood, pd_all[pd_all.label==label].shape[0]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 字段说明\n",
"\n",
"| 字段 | 说明 |\n",
"| ---- | ---- |\n",
"| label | 0 喜悦,1 愤怒,2 厌恶,3 低落 |\n",
"| review | 微博内容 |"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>label</th>\n",
" <th>review</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>307114</th>\n",
" <td>3</td>\n",
" <td>回复美国看起来很美,对别人比较狠!对付哪国人,就用哪国人做他的腿,简称狗腿落后的祖宗挨过打!...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>258815</th>\n",
" <td>2</td>\n",
" <td>我表示压力狠大。!哇。犀利妹!偶尔街拍,其实姐只是一个你永远无法超越的传说。</td>\n",
" </tr>\n",
" <tr>\n",
" <th>249801</th>\n",
" <td>1</td>\n",
" <td>可怜,帮这孩子转下,希望不会因为涉嫌联系业务负什么责任啊…………是想粉丝想疯了什么情况啊?想...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>165587</th>\n",
" <td>0</td>\n",
" <td>哦也~ ~ ~ !得瑟哈哈哈耶~ ~ ~ !新logo 。。。。我们的logo 会不会抢了的...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>351395</th>\n",
" <td>3</td>\n",
" <td>我发现真的是最齐全的一张。这是去看北方儿子的时候啊。怀念。对了,我怎么穿那件破衬衫。。好难看...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>339894</th>\n",
" <td>3</td>\n",
" <td>看你那个享受的表情nuna 很感动~</td>\n",
" </tr>\n",
" <tr>\n",
" <th>307523</th>\n",
" <td>3</td>\n",
" <td>不得不轉 !大家淚 奔吧哈哈27開 始,短短8秒,我咽哽了</td>\n",
" </tr>\n",
" <tr>\n",
" <th>124636</th>\n",
" <td>0</td>\n",
" <td>早看到了,再看到还是想笑,好可爱啊</td>\n",
" </tr>\n",
" <tr>\n",
" <th>56901</th>\n",
" <td>0</td>\n",
" <td>快来围观我的小丸子模板~ ~ 哇咔咔~ 得瑟~ ~ ~</td>\n",
" </tr>\n",
" <tr>\n",
" <th>106905</th>\n",
" <td>0</td>\n",
" <td>也未免太厉害了吧.......观看完此视频之后,我终于明白了香港歌星GEM—— 邓紫棋走红的...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>291966</th>\n",
" <td>2</td>\n",
" <td>天啊…是住家发生爆炸了,天热,各位注意安全。一朋友开化工厂的。唉。注意安全。真难以想像,不知...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>321489</th>\n",
" <td>3</td>\n",
" <td>肯德基你就不会带个头,做件好事可爱的脖子们,帮她圆了梦吧~ ~ 小时候来北京,吃过一种小糕点...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>188566</th>\n",
" <td>0</td>\n",
" <td>想去桂林,上学时候就学到一课文说桂林山水甲天下,一直想去看看品橙网国庆旅游胜地创意评奖活动开...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15444</th>\n",
" <td>0</td>\n",
" <td>晃姐姐口才真不是一般人的高,这大概就是文凭带来的区别吧。拿着真文凭的人总会觉得那是自己的底线...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>56820</th>\n",
" <td>0</td>\n",
" <td>火火happy birthday 天蝎座的人虽然喜欢隐藏自己,但是他喜欢掌握每天生活当中与他...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>257031</th>\n",
" <td>2</td>\n",
" <td>好久没看了。。。还是那么的感动~ ~ ~ ~</td>\n",
" </tr>\n",
" <tr>\n",
" <th>144782</th>\n",
" <td>0</td>\n",
" <td>你看他像几岁?关键是牛尔多大?【分享图片】现场挑战高难度抗衰老奇迹~ 看看他都使用倩碧什么产品~</td>\n",
" </tr>\n",
" <tr>\n",
" <th>130776</th>\n",
" <td>0</td>\n",
" <td>比江苏台的好玩这个真的很搞笑,再次推荐!哈哈,这个绝对值得一看,搞笑死了。当然其中的讽刺意味...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>59158</th>\n",
" <td>0</td>\n",
" <td>【YMG 推荐】来,哥让你见识下,什么是真正的招财猫!要发财的童鞋抱走~ ~ 在海味舖 買 ?</td>\n",
" </tr>\n",
" <tr>\n",
" <th>240262</th>\n",
" <td>1</td>\n",
" <td>该带套的时候要带上。大哥,你就得瑟吧和吃饭。美女很美很火。因为吃香辣小龙虾,我的衬衫歇火了。...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" label review\n",
"307114 3 回复美国看起来很美,对别人比较狠!对付哪国人,就用哪国人做他的腿,简称狗腿落后的祖宗挨过打!...\n",
"258815 2 我表示压力狠大。!哇。犀利妹!偶尔街拍,其实姐只是一个你永远无法超越的传说。\n",
"249801 1 可怜,帮这孩子转下,希望不会因为涉嫌联系业务负什么责任啊…………是想粉丝想疯了什么情况啊?想...\n",
"165587 0 哦也~ ~ ~ !得瑟哈哈哈耶~ ~ ~ !新logo 。。。。我们的logo 会不会抢了的...\n",
"351395 3 我发现真的是最齐全的一张。这是去看北方儿子的时候啊。怀念。对了,我怎么穿那件破衬衫。。好难看...\n",
"339894 3 看你那个享受的表情nuna 很感动~\n",
"307523 3 不得不轉 !大家淚 奔吧哈哈27開 始,短短8秒,我咽哽了\n",
"124636 0 早看到了,再看到还是想笑,好可爱啊\n",
"56901 0 快来围观我的小丸子模板~ ~ 哇咔咔~ 得瑟~ ~ ~\n",
"106905 0 也未免太厉害了吧.......观看完此视频之后,我终于明白了香港歌星GEM—— 邓紫棋走红的...\n",
"291966 2 天啊…是住家发生爆炸了,天热,各位注意安全。一朋友开化工厂的。唉。注意安全。真难以想像,不知...\n",
"321489 3 肯德基你就不会带个头,做件好事可爱的脖子们,帮她圆了梦吧~ ~ 小时候来北京,吃过一种小糕点...\n",
"188566 0 想去桂林,上学时候就学到一课文说桂林山水甲天下,一直想去看看品橙网国庆旅游胜地创意评奖活动开...\n",
"15444 0 晃姐姐口才真不是一般人的高,这大概就是文凭带来的区别吧。拿着真文凭的人总会觉得那是自己的底线...\n",
"56820 0 火火happy birthday 天蝎座的人虽然喜欢隐藏自己,但是他喜欢掌握每天生活当中与他...\n",
"257031 2 好久没看了。。。还是那么的感动~ ~ ~ ~\n",
"144782 0 你看他像几岁?关键是牛尔多大?【分享图片】现场挑战高难度抗衰老奇迹~ 看看他都使用倩碧什么产品~\n",
"130776 0 比江苏台的好玩这个真的很搞笑,再次推荐!哈哈,这个绝对值得一看,搞笑死了。当然其中的讽刺意味...\n",
"59158 0 【YMG 推荐】来,哥让你见识下,什么是真正的招财猫!要发财的童鞋抱走~ ~ 在海味舖 買 ?\n",
"240262 1 该带套的时候要带上。大哥,你就得瑟吧和吃饭。美女很美很火。因为吃香辣小龙虾,我的衬衫歇火了。..."
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd_all.sample(20)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
},
"widgets": {
"state": {},
"version": "1.1.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}