intro.ipynb 11.7 KB
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# simplifyweibo_4_moods 说明\n",
    "0. **下载地址:** [百度网盘](https://pan.baidu.com/s/16c93E5x373nsGozyWevITg)\n",
    "1. **数据概览:** 36 万多条,带情感标注 新浪微博,包含 4 种情感,其中喜悦约 20 万条,愤怒、厌恶、低落各约 5 万条\n",
    "2. **推荐实验:** 情感/观点/评论 倾向性分析\n",
    "2. **数据来源:** [新浪微博](https://weibo.com/)\n",
    "3. **原数据集:** [微博情感分析数据集](https://download.csdn.net/download/turkan/9181661),网上搜集,具体作者、来源不详\n",
    "4. **加工处理:**\n",
    "    1. 将原来的 4 份文档,整合成 1 份 csv 文件\n",
    "    2. 原始语料进行了分词处理,我们重新将其还原为未分词的语料\n",
    "    3. 编码统一为 UTF-8\n",
    "    4. 去重"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = 'simplifyweibo_4_moods_文件夹_所在_路径'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. simplifyweibo_4_moods.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 加载数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "微博数目(总体):361744\n",
      "微博数目(喜悦):199496\n",
      "微博数目(愤怒):51714\n",
      "微博数目(厌恶):55267\n",
      "微博数目(低落):55267\n"
     ]
    }
   ],
   "source": [
    "pd_all = pd.read_csv(path + 'simplifyweibo_4_moods.csv')\n",
    "moods = {0: '喜悦', 1: '愤怒', 2: '厌恶', 3: '低落'}\n",
    "\n",
    "print('微博数目(总体):%d' % pd_all.shape[0])\n",
    "\n",
    "for label, mood in moods.items(): \n",
    "    print('微博数目({}):{}'.format(mood,  pd_all[pd_all.label==label].shape[0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 字段说明\n",
    "\n",
    "| 字段 | 说明 |\n",
    "| ---- | ---- |\n",
    "| label | 0 喜悦,1 愤怒,2 厌恶,3 低落 |\n",
    "| review | 微博内容 |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>review</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>307114</th>\n",
       "      <td>3</td>\n",
       "      <td>回复美国看起来很美,对别人比较狠!对付哪国人,就用哪国人做他的腿,简称狗腿落后的祖宗挨过打!...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>258815</th>\n",
       "      <td>2</td>\n",
       "      <td>我表示压力狠大。!哇。犀利妹!偶尔街拍,其实姐只是一个你永远无法超越的传说。</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>249801</th>\n",
       "      <td>1</td>\n",
       "      <td>可怜,帮这孩子转下,希望不会因为涉嫌联系业务负什么责任啊…………是想粉丝想疯了什么情况啊?想...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>165587</th>\n",
       "      <td>0</td>\n",
       "      <td>哦也~ ~ ~ !得瑟哈哈哈耶~ ~ ~ !新logo 。。。。我们的logo 会不会抢了的...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>351395</th>\n",
       "      <td>3</td>\n",
       "      <td>我发现真的是最齐全的一张。这是去看北方儿子的时候啊。怀念。对了,我怎么穿那件破衬衫。。好难看...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>339894</th>\n",
       "      <td>3</td>\n",
       "      <td>看你那个享受的表情nuna 很感动~</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>307523</th>\n",
       "      <td>3</td>\n",
       "      <td>不得不轉 !大家淚 奔吧哈哈27開 始,短短8秒,我咽哽了</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>124636</th>\n",
       "      <td>0</td>\n",
       "      <td>早看到了,再看到还是想笑,好可爱啊</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>56901</th>\n",
       "      <td>0</td>\n",
       "      <td>快来围观我的小丸子模板~ ~ 哇咔咔~ 得瑟~ ~ ~</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>106905</th>\n",
       "      <td>0</td>\n",
       "      <td>也未免太厉害了吧.......观看完此视频之后,我终于明白了香港歌星GEM—— 邓紫棋走红的...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>291966</th>\n",
       "      <td>2</td>\n",
       "      <td>天啊…是住家发生爆炸了,天热,各位注意安全。一朋友开化工厂的。唉。注意安全。真难以想像,不知...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>321489</th>\n",
       "      <td>3</td>\n",
       "      <td>肯德基你就不会带个头,做件好事可爱的脖子们,帮她圆了梦吧~ ~ 小时候来北京,吃过一种小糕点...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>188566</th>\n",
       "      <td>0</td>\n",
       "      <td>想去桂林,上学时候就学到一课文说桂林山水甲天下,一直想去看看品橙网国庆旅游胜地创意评奖活动开...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15444</th>\n",
       "      <td>0</td>\n",
       "      <td>晃姐姐口才真不是一般人的高,这大概就是文凭带来的区别吧。拿着真文凭的人总会觉得那是自己的底线...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>56820</th>\n",
       "      <td>0</td>\n",
       "      <td>火火happy birthday 天蝎座的人虽然喜欢隐藏自己,但是他喜欢掌握每天生活当中与他...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>257031</th>\n",
       "      <td>2</td>\n",
       "      <td>好久没看了。。。还是那么的感动~ ~ ~ ~</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>144782</th>\n",
       "      <td>0</td>\n",
       "      <td>你看他像几岁?关键是牛尔多大?【分享图片】现场挑战高难度抗衰老奇迹~ 看看他都使用倩碧什么产品~</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>130776</th>\n",
       "      <td>0</td>\n",
       "      <td>比江苏台的好玩这个真的很搞笑,再次推荐!哈哈,这个绝对值得一看,搞笑死了。当然其中的讽刺意味...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>59158</th>\n",
       "      <td>0</td>\n",
       "      <td>【YMG 推荐】来,哥让你见识下,什么是真正的招财猫!要发财的童鞋抱走~ ~ 在海味舖 買 ?</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>240262</th>\n",
       "      <td>1</td>\n",
       "      <td>该带套的时候要带上。大哥,你就得瑟吧和吃饭。美女很美很火。因为吃香辣小龙虾,我的衬衫歇火了。...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        label                                             review\n",
       "307114      3  回复美国看起来很美,对别人比较狠!对付哪国人,就用哪国人做他的腿,简称狗腿落后的祖宗挨过打!...\n",
       "258815      2             我表示压力狠大。!哇。犀利妹!偶尔街拍,其实姐只是一个你永远无法超越的传说。\n",
       "249801      1  可怜,帮这孩子转下,希望不会因为涉嫌联系业务负什么责任啊…………是想粉丝想疯了什么情况啊?想...\n",
       "165587      0  哦也~ ~ ~ !得瑟哈哈哈耶~ ~ ~ !新logo 。。。。我们的logo 会不会抢了的...\n",
       "351395      3  我发现真的是最齐全的一张。这是去看北方儿子的时候啊。怀念。对了,我怎么穿那件破衬衫。。好难看...\n",
       "339894      3                                 看你那个享受的表情nuna 很感动~\n",
       "307523      3                      不得不轉 !大家淚 奔吧哈哈27開 始,短短8秒,我咽哽了\n",
       "124636      0                                  早看到了,再看到还是想笑,好可爱啊\n",
       "56901       0                        快来围观我的小丸子模板~ ~ 哇咔咔~ 得瑟~ ~ ~\n",
       "106905      0  也未免太厉害了吧.......观看完此视频之后,我终于明白了香港歌星GEM—— 邓紫棋走红的...\n",
       "291966      2  天啊…是住家发生爆炸了,天热,各位注意安全。一朋友开化工厂的。唉。注意安全。真难以想像,不知...\n",
       "321489      3  肯德基你就不会带个头,做件好事可爱的脖子们,帮她圆了梦吧~ ~ 小时候来北京,吃过一种小糕点...\n",
       "188566      0  想去桂林,上学时候就学到一课文说桂林山水甲天下,一直想去看看品橙网国庆旅游胜地创意评奖活动开...\n",
       "15444       0  晃姐姐口才真不是一般人的高,这大概就是文凭带来的区别吧。拿着真文凭的人总会觉得那是自己的底线...\n",
       "56820       0  火火happy birthday 天蝎座的人虽然喜欢隐藏自己,但是他喜欢掌握每天生活当中与他...\n",
       "257031      2                             好久没看了。。。还是那么的感动~ ~ ~ ~\n",
       "144782      0   你看他像几岁?关键是牛尔多大?【分享图片】现场挑战高难度抗衰老奇迹~ 看看他都使用倩碧什么产品~\n",
       "130776      0  比江苏台的好玩这个真的很搞笑,再次推荐!哈哈,这个绝对值得一看,搞笑死了。当然其中的讽刺意味...\n",
       "59158       0    【YMG 推荐】来,哥让你见识下,什么是真正的招财猫!要发财的童鞋抱走~ ~ 在海味舖 買 ?\n",
       "240262      1  该带套的时候要带上。大哥,你就得瑟吧和吃饭。美女很美很火。因为吃香辣小龙虾,我的衬衫歇火了。..."
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd_all.sample(20)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  },
  "widgets": {
   "state": {},
   "version": "1.1.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}