intro.ipynb
15.2 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# waimai_10k 说明\n",
"0. **下载地址:** [Github](https://github.com/SophonPlus/ChineseNlpCorpus/raw/master/datasets/waimai_10k/waimai_10k.csv)\n",
"1. **数据概览:** 某外卖平台收集的用户评价,正向 4000 条,负向 约 8000 条\n",
"2. **推荐实验:** 情感/观点/评论 倾向性分析\n",
"2. **数据来源:** 某外卖平台\n",
"3. **原数据集:** [中文短文本情感分析语料 外卖评价](https://download.csdn.net/download/cstkl/10236683),网上搜集,具体作者、来源不详\n",
"4. **加工处理:**\n",
" 1. 将原来 2 个文件整合到 1 个文件中\n",
" 2. 去重"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"path = 'waimai_10k_文件夹_所在_路径'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. waimai_10k.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 加载数据"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"评论数目(总体):11987\n",
"评论数目(正向):4000\n",
"评论数目(负向):7987\n"
]
}
],
"source": [
"pd_all = pd.read_csv(path + 'waimai_10k.csv')\n",
"\n",
"print('评论数目(总体):%d' % pd_all.shape[0])\n",
"print('评论数目(正向):%d' % pd_all[pd_all.label==1].shape[0])\n",
"print('评论数目(负向):%d' % pd_all[pd_all.label==0].shape[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 字段说明\n",
"\n",
"| 字段 | 说明 |\n",
"| ---- | ---- |\n",
"| label | 1 表示正向评论,0 表示负向评论 |\n",
"| review | 评论内容 |"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>label</th>\n",
" <th>review</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>25</th>\n",
" <td>1</td>\n",
" <td>送餐特别快,态度也好,辛苦啦</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6632</th>\n",
" <td>0</td>\n",
" <td>点了热带雨林披萨+饮料,和BBQ鸡肉披萨+饮料,送来的是两个奥尔良披萨+两个银耳冰粥,冰凉冰...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8849</th>\n",
" <td>0</td>\n",
" <td>难吃!!!油死了,味道烂</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11114</th>\n",
" <td>0</td>\n",
" <td>今天菜太咸,连着定了3天吃,一天比一天难吃。</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11661</th>\n",
" <td>0</td>\n",
" <td>送的太慢了,菜都凉了。</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9571</th>\n",
" <td>0</td>\n",
" <td>没有满减!</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10614</th>\n",
" <td>0</td>\n",
" <td>差评!定的时间是12点一刻,结果刚11点就送来了!果断退单。送餐前不看时间吗?</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7585</th>\n",
" <td>0</td>\n",
" <td>羊肉串太咸,还有些不新鲜。鸡心和鸡胗烤的太老</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6919</th>\n",
" <td>0</td>\n",
" <td>快递员挺好,速度挺快</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3192</th>\n",
" <td>1</td>\n",
" <td>小炒肉卷饼好辣~</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10224</th>\n",
" <td>0</td>\n",
" <td>送来的时候都凉了,味道一般,鲜果西米露就两口的量,鲜果就是一块西瓜一个西瓜籽</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7295</th>\n",
" <td>0</td>\n",
" <td>没放糖,没放奶油,好难喝</td>\n",
" </tr>\n",
" <tr>\n",
" <th>275</th>\n",
" <td>1</td>\n",
" <td>他家的奶茶超级好喝。。。</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8378</th>\n",
" <td>0</td>\n",
" <td>黑椒牛柳饭送成大排饭</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5879</th>\n",
" <td>0</td>\n",
" <td>一个半小时,可以</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7523</th>\n",
" <td>0</td>\n",
" <td>订单满减后应该是24,送过来要收我原价39?你搞笑呐,还少听加多宝!我管你什么美食送的还是你...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6590</th>\n",
" <td>0</td>\n",
" <td>真心也忒慢了,其他都还成</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1703</th>\n",
" <td>1</td>\n",
" <td>非常划算,很好</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5345</th>\n",
" <td>0</td>\n",
" <td>首选是得吐槽一下这家的速度,一个半小时起,然后卷饼包装很不错,酱香鸡肉的比较赞,飘香肘子一般...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1674</th>\n",
" <td>1</td>\n",
" <td>离我们远点55分钟送到的,可以理解,饼和粥都不错</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" label review\n",
"25 1 送餐特别快,态度也好,辛苦啦\n",
"6632 0 点了热带雨林披萨+饮料,和BBQ鸡肉披萨+饮料,送来的是两个奥尔良披萨+两个银耳冰粥,冰凉冰...\n",
"8849 0 难吃!!!油死了,味道烂\n",
"11114 0 今天菜太咸,连着定了3天吃,一天比一天难吃。\n",
"11661 0 送的太慢了,菜都凉了。\n",
"9571 0 没有满减!\n",
"10614 0 差评!定的时间是12点一刻,结果刚11点就送来了!果断退单。送餐前不看时间吗?\n",
"7585 0 羊肉串太咸,还有些不新鲜。鸡心和鸡胗烤的太老\n",
"6919 0 快递员挺好,速度挺快\n",
"3192 1 小炒肉卷饼好辣~\n",
"10224 0 送来的时候都凉了,味道一般,鲜果西米露就两口的量,鲜果就是一块西瓜一个西瓜籽\n",
"7295 0 没放糖,没放奶油,好难喝\n",
"275 1 他家的奶茶超级好喝。。。\n",
"8378 0 黑椒牛柳饭送成大排饭\n",
"5879 0 一个半小时,可以\n",
"7523 0 订单满减后应该是24,送过来要收我原价39?你搞笑呐,还少听加多宝!我管你什么美食送的还是你...\n",
"6590 0 真心也忒慢了,其他都还成\n",
"1703 1 非常划算,很好\n",
"5345 0 首选是得吐槽一下这家的速度,一个半小时起,然后卷饼包装很不错,酱香鸡肉的比较赞,飘香肘子一般...\n",
"1674 1 离我们远点55分钟送到的,可以理解,饼和粥都不错"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd_all.sample(20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2. 构造平衡语料"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"pd_positive = pd_all[pd_all.label==1]\n",
"pd_negative = pd_all[pd_all.label==0]\n",
"\n",
"def get_balance_corpus(corpus_size, corpus_pos, corpus_neg):\n",
" sample_size = corpus_size // 2\n",
" pd_corpus_balance = pd.concat([corpus_pos.sample(sample_size, replace=corpus_pos.shape[0]<sample_size), \\\n",
" corpus_neg.sample(sample_size, replace=corpus_neg.shape[0]<sample_size)])\n",
" \n",
" print('评论数目(总体):%d' % pd_corpus_balance.shape[0])\n",
" print('评论数目(正向):%d' % pd_corpus_balance[pd_corpus_balance.label==1].shape[0])\n",
" print('评论数目(负向):%d' % pd_corpus_balance[pd_corpus_balance.label==0].shape[0]) \n",
" \n",
" return pd_corpus_balance"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"评论数目(总体):4000\n",
"评论数目(正向):2000\n",
"评论数目(负向):2000\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>label</th>\n",
" <th>review</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>10436</th>\n",
" <td>0</td>\n",
" <td>难吃~石锅拌饭居然没酱~而且刚好晚了29分钟</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10468</th>\n",
" <td>0</td>\n",
" <td>等了很久,没关系,毕竟还在约定时间内,可是最让我忍不了的是真的很一般,个人口味吧,反正不和我...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1643</th>\n",
" <td>1</td>\n",
" <td>嗯,纸袋比较高大上</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8723</th>\n",
" <td>0</td>\n",
" <td>海参怎么是生的,没法吃,郁闷</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2431</th>\n",
" <td>1</td>\n",
" <td>送餐很快,送餐人员很热情!~</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5121</th>\n",
" <td>0</td>\n",
" <td>不如以前好吃,肘子都有味儿了!哎!</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10565</th>\n",
" <td>0</td>\n",
" <td>东西有些小贵。</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2413</th>\n",
" <td>1</td>\n",
" <td>虽然时间长了些但是很准时。下次记得给些番茄酱就更好了。,一个人吃足够了。好好吃</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11937</th>\n",
" <td>0</td>\n",
" <td>11点以前就定的餐,做了1小时48分钟,呵呵,我只想说:拜拜!!!</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1024</th>\n",
" <td>1</td>\n",
" <td>很好吃,面皮特别有嚼劲儿,酱料也很好吃</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" label review\n",
"10436 0 难吃~石锅拌饭居然没酱~而且刚好晚了29分钟\n",
"10468 0 等了很久,没关系,毕竟还在约定时间内,可是最让我忍不了的是真的很一般,个人口味吧,反正不和我...\n",
"1643 1 嗯,纸袋比较高大上\n",
"8723 0 海参怎么是生的,没法吃,郁闷\n",
"2431 1 送餐很快,送餐人员很热情!~\n",
"5121 0 不如以前好吃,肘子都有味儿了!哎!\n",
"10565 0 东西有些小贵。\n",
"2413 1 虽然时间长了些但是很准时。下次记得给些番茄酱就更好了。,一个人吃足够了。好好吃\n",
"11937 0 11点以前就定的餐,做了1小时48分钟,呵呵,我只想说:拜拜!!!\n",
"1024 1 很好吃,面皮特别有嚼劲儿,酱料也很好吃"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"waimai_10k_ba_4000 = get_balance_corpus(4000, pd_positive, pd_negative)\n",
"\n",
"waimai_10k_ba_4000.sample(10)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
},
"widgets": {
"state": {},
"version": "1.1.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}