Weakly supervised learning

数据标注成本过高，导致数据不具备完全正确的监督信息。

弱监督学习可以划分为三类[1]：

不完整（incomplete）的监督信息：数据集中只有部分（小部分）含有标记，大部分没有标记。
不精确（inexact）的监督信息：只有粗粒度的标记。
不正确（inaccurate）的监督信息：给的标记并不总是真实标记。

不完整（incomplete）的监督信息学习

半监督学习（semi-supervised learning）[2,3]：
- 相似示例具有相似的类标记：聚类假设（cluster assumption）和流形假设（manifold assumption）。
- 半监督SVM，半监督图学习，协同训练（co-training）
主动学习（active learning）:允许对少量无标记数据查询其正确的标记从而改善学习性能。
- 挑选最少最有用的查询样本：挑选不确定性最高样本或差异性最大样本。
  - 不确定性样本：信息量大，查询这些样本可以降低模型的不确定程度。
  - 差异性大样本：不重复，能够更完整的描述整个数据集。
PU学习（positive-unlabeled learning）[4,5]：标记只有正样本，负样本过于昂贵，如临床数据。

不精确（inexact）的监督信息学习

多示例学习（multi-instance learning）：多个示例共同组成一个包，如果这个包中有正样本，那么
这个包为正包，但并不知道哪个或哪些样本为正。
偏标记学习（partial-label learning）：也被称为模糊标记学习（ambiguously label learning）或超集合标记学习（superset label learning）。在偏标记学习中，每个样本对应于多个候选标记 (candidate label)，其中有且只有一个能描述该样本的真实语义信息，而这个正确的标记在学习过程中是未知的。
- 基于辨识的策略（identification-based strategy）：也称为消歧（disambiguating），旨在在训练阶段启发式地对每个候选标记集进行净化，提取正确的标记来消除候选标记造成的监督不精确。
  - 其中所有候选标记是一种竞争关系，即每个标记对学习目标的影响是不同的，并且一个标记的影响变大，其他标记的影响则势必会变小。
  - 基于辨识策略的算法在给每个候选标记打分 (score) 和通过优化某些特殊设计的评价准则之间迭代。候选标记的分数表示它是正确标记的可能性，训练完成时分数最大的认为是正确标记，通常用估计出来的类后验概率当作分数。
  - 将正确标记看作隐变量，学习目标是直接最大化一个候选标记上的输出，将其作为正确标记。
  - 最大似然，最大间隔，深度学习
- 基于平均的策略（average-based strategy）:认为所有候选标记都是等价的，所有候选标记以协作的方式对学习目标造成影响，在训练过程中不需要识别出隐含的正确标记。在训练阶段将所有候选标记“一视同仁”，每个候选标记对学习目标造成的影响是相同的。
- 补标记学习（complementary-label learning）：补标记为每个示例指定一个它一定不属于的类，因此它可以被认为是一个极端情况下的
  偏标记，每个示例与（类别数-1）个候选标记相对应。
缺失标记学习（missing-label learning）[9,10]：示例与标记间的对应关系有所缺失，注意这种缺失没有数量限制，对缺失的是相关标记还是无关标记也没有限制，即，缺失“示例-标记”关系对。
- 低秩假设（low rank assumption）[11,12]：所有示例的标记构成的矩阵是低秩的，即，假如标记空间有个标记，那么可能出现的标记组合远远小于个，这意味着有一些标记是强相关的，而有一些标记是矛盾的。
  - 映射模型[14,15]：恢复的标记矩阵是低秩的，同时满足和特征矩阵之间的映射关系。
  - 关系模型[16]：假设缺失的标记和可见的标记间存在某种关系模型，将原始标记空间投影到低秩空间后再推断模型的参数。
- 流形假设：相似特征的示例具有相近标记，利用特征空间的信息来对标记空间的缺失做一个补充。
- 低秩+流形[17,18]
- 模型优化[19,20]

不正确（inaccurate）的监督信息学习

噪声标记学习（noisy-label learning）：
- 标记矫正[6,7]：通过一个干净的推断步骤来矫正错误标记
- 损失矫正[8]：修改损失函数来提高对错误标记的鲁棒性
- 改进训练策略：MentorNet & StudentNet, co-teaching
- 对噪声鲁棒的损失函数

参考文献：
[1] Zhou, Zhihua. A brief introduction to weakly supervised learning[J]. National Science Review, 2017, 5(1):44–53.
[2] Zhu, Xiaojin J. Semi-supervised learning literature survey[J]. University of WisconsinMadison Department of Computer Sciences, 2005
[3] Chapelle, Olivier, Scholkopf, Bernhard, and Zien, Alexander. Semi-supervised learning[J]. IEEE Transactions on Neural Networks, 2009, 20(3):542–542.
[4] du Plessis, Marthinus C, Niu, Gang, and Sugiyama, Masashi. Analysis of learning from positive and unlabeled data[C]. In: Advances in neural information processing systems. 2014. 703–711.
[5] Kiryo, Ryuichi, Niu, Gang, du Plessis, Marthinus C, et al. Positive-unlabeled learning with non-negative risk estimator[C]. In: Advances in Neural Information Processing Systems. 2017. 1675–1685.
[6] Xiao, Tong, Xia, Tian, Yang, Yi, et al. Learning from massive noisy labeled data for image classification[C]. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. 2691–2699.
[7] Veit, Andreas, Alldrin, Neil, Chechik, Gal, et al. Learning from noisy large-scale datasets with minimal supervision[C]. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. 839–847.
[8] Patrini, Giorgio, Rozza, Alessandro, Menon, Aditya Krishna, et al. Making deep neural networks robust to label noise: A loss correction approach[C]. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. 1944–1952
[9] Yu, Hsiang-Fu, Jain, Prateek, Kar, Purushottam, et al. Large-scale multi-label learning with missing labels[C]. In: Proceedings of the International Conference on Machine Learning. 2014. 593–601.
[10] Xu, Miao, Jin, Rong, and Zhou, Zhihua. CUR algorithm for partially observed matrices[C]. In: International Conference on Machine Learning. 2015. 1412–1421.
[11] Goldberg, Andrew, Recht, Ben, Xu, Junming, et al. Transduction with matrix completion: Three birds with one stone[J]. Advances in neural information processing systems, 2010, pages 757–765.
[12] Candes, Emmanuel J. and Recht, Benjamin. Exact matrix completion via convex optimization[J]. Foundations of Computational mathematics, 2009, 9(6):717–772.
[13] Jing, Liping, Yang, Liu, Yu, Jian, et al. Semi-supervised low-rank mapping learning for multi-label classification[C]. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. 1483–1491.
[14] Yu, Hsiang-Fu, Jain, Prateek, Kar, Purushottam, et al. Large-scale multi-label learning
with missing labels[C]. In: Proceedings of the International Conference on Machine
Learning. 2014. 593–601.
[15] Xu, Linli, Wang, Zhen, Shen, Zefan, et al. Learning low-rank label correlations for multilabel classification with missing labels[C]. In: Proceedings of the IEEE International
Conference on Data Mining. 2014. 1067–1072.
[16] Bi, Wei and Kwok, James. Multilabel classification with label correlations and missing
labels[C]. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2014.
1680–1686.
[17] Jing, Liping, Yang, Liu, Yu, Jian, et al. Semi-supervised low-rank mapping learning for multi-label classification[C]. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recog.
[18] Zhao, Feipeng and Guo, Yuhong. Semi-supervised multi-label learning with incomplete labels[C]. In: Proceedings of the International Joint Conference on Artificial Intelligence. 2015. 4062–4068.
[19] Xu, Miao, Jin, Rong, and Zhou, Zhihua. CUR algorithm for partially observed matrices[C]. In: International Conference on Machine Learning. 2015. 1412–1421.
[20] Xu, Miao, Jin, Rong, and Zhou, Zhihua. Speedup matrix completion with side information: Application to multi-label learning[C]. In: Advances in neural information processing systems. 2013. 2301–2309.