[1]李艳翠,郭鹏程,苗国义.融合双语信息的汉语篇章主次识别方法[J].南京师大学报(自然科学版),2026,49(02):74-84.[doi:10.3969/j.issn.1001-4616.2026.02.008]
 Li Yancui,Guo Pengcheng,Miao Guoyi.Integration of Bilingual Information for Nuclearity Recognition in Chinese Discourse[J].Journal of Nanjing Normal University(Natural Science Edition),2026,49(02):74-84.[doi:10.3969/j.issn.1001-4616.2026.02.008]
点击复制

融合双语信息的汉语篇章主次识别方法()

《南京师大学报(自然科学版)》[ISSN:1001-4616/CN:32-1239/N]

卷:
49
期数:
2026年02期
页码:
74-84
栏目:
计算机科学与技术
出版日期:
2026-04-10

文章信息/Info

Title:
Integration of Bilingual Information for Nuclearity Recognition in Chinese Discourse
文章编号:
1001-4616(2026)02-0074-11
作者:
李艳翠郭鹏程苗国义
1.河南师范大学计算机与信息工程学院,河南 新乡 453007
2.教育人工智能与个性化学习河南省重点实验室,河南 新乡 453007
3.河南省教学资源与教育质量评估大数据工程研究中心,河南 新乡 453007
Author(s):
Li YancuiGuo PengchengMiao Guoyi
1.School of Computer and Information Engineering,Henan Normal University,Xinxiang 453007,China
2.Henan Key Laboratory of Educational Artificial Intelligence and Personalized Learning,Xinxiang 453007,China
3.Henan Engineering Research Center of Teaching Resources & Education Quality Evaluation Big Data,Xinxiang 453007,China
关键词:
篇章分析主次识别预训练模型双语信息
Keywords:
discourse analysis nuclearity recognition pretrained models bilingual information
分类号:
TP391
DOI:
10.3969/j.issn.1001-4616.2026.02.008
文献标志码:
A
摘要:
在主次识别中,汉语句子间的显式衔接手段较少,因此其主次识别具有极大的挑战性. 英语大多用明确的主从结构或连接词来表示句子的主次关系,而现有方法在训练模型时没有利用英语信息. 与现有方法在训练模型时单独使用中文数据不同,提出的方法在训练模型时使用平行双语数据. 对双语文本编码时使用多语言预训练模型,在得到的编码上应用多头注意力机制,捕获显式或隐含于句中的主从信息. 在汉语篇章树库(Chinese Discourse Treebank,CDTB)上的实验显示,提出的模型和方法比之前最好的GMN-Nu模型在宏平均F1值和微平均F1值上提高了8.7%和6.1%; 相较于仅使用预训练模型和单语数据训练的方法,融合双语信息的主次识别方法对于mBERT、mT5、XLM-R 3种模型在微平均F1值上分别提高了1.6%、3.5%、1.3%. 在汉英篇章结构平行语料库(Chinese-English Discourse Treebank,CEDT)上的实验显示,融合双语信息的主次识别方法比单语言的主次识别方法在微平均F1值和宏平均F1值上分别提升了10.2%和5.8%.
Abstract:
Chinese nuclearity recognition encounters inherent difficulties owing to limited explicit inter-sentential connectives. In contrast, English systematically marks nuclearity through subordinate constructions and discourse markers. Current approaches trained models exclusively on Chinese corpora without leveraging English signals. Our methodology addresses this gap by incorporating parallel bilingual training data. A multilingual pre-trained model processed the bilingual texts, and a multi-head attention mechanism captured explicit and implicit nuclearity features. Experiments on the Chinese Discourse Treebank(CDTB)showed that our model achieved 8.7% and 6.1% improvements in Macro-F1 and Micro-F1 scores over the previous state-of-the-art GMN-Nu model. Compared to monolingual training with mBERT, mT5, and XLM-R, the bilingual fusion strategy increased Micro-F1 by 1.6%, 3.5% and 1.3%, respectively. Additional tests on the Chinese-English Discourse Treebank(CEDT)demonstrated 10.2% and 5.8% gains in Micro-F1 and Macro-F1 over monolingual methods.

参考文献/References:

[1]Mann W C,Thompson S A. Rethorical Structure Theory:toward a functional theory of text organization[J]. Text-Interdisciplinary Journal for the Study of Discourse,1988,8(3):243-281.
[2]Li Y C,Feng W H,Sun J,et al. Building Chinese discourse corpus with connective-driven dependency tree structure[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Doha:Association for Computational Linguistics,2014:2105-2114.
[3]Zhou Y P,Xue N W. PDTB-style discourse annotation of Chinese text[C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Jeju Island:Association for Computational Linguistics,2012:69-77.
[4]褚晓敏,朱巧明,周国栋. 自然语言处理中的篇章主次关系研究[J]. 计算机学报,2017,40(4):842-860.
[5]Carlson L,Marcu D,Okurowski M E. Building a discourse-tagged corpus in the framework of rhetorical structure theory[M]//Current and New Directions in Discourse and Dialogue. Dordechat,Netherlands:Klumer Academic Publishers,2003:85-112.
[6]Jiang F,Xu S,Chu X M,et al. MCDTB:A macro-level Chinese discourse TreeBank[C]//Proceedings of the 27th International Conference on Computational Linguistics. Santa Fe:Association for Computational Linguistics,2018:3493-3504.
[7]Joty S R,Carenini G,Ng R T,et al. Combining intra- and multi-sentential rhetorical parsing for document-level discourse analysis[C]//Proceedings of the 51st Annuel Meeting of the Association for Computational Linguistics. Bulgauria:Association for Computational Lingustics,2013:486-496.
[8]Li S J,Wang L,Cao Z Q,et al. Text-level discourse dependency parsing[C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Baltimore:Association for Computational Linguistics,2014:25-35.
[9]Ji Y,Eisenstein J. Representation learning for text-level discourse parsing[C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Baltimore:Association for Computational Linguistics,2014:13-24.
[10]Kong F,Wang H L,Zhou G D. A CDT-Styled end-to-end Chinese discourse parser[J]. ACM Transactions on Asian and Low-Resource Language Information Processing,2017,16(4):387-398.
[11]Zhang L Y,Xing Y Q,Kong F,et al. A Top-Down neural architecture towards text-level parsing of discourse rhetorical structure[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online:Association for Computational Linguistics,2020:6386-6395.
[12]Xu S,Li P F,Zhou G D,et al. Employing text matching network to recognise nuclearity in Chinese discourse[C]//Proceedings of the 27th International Conference on Computational Linguistics. Online:Association for Computational Linguistics,2018:525-535.
[13]王体爽,李培峰,朱巧明. 基于门控记忆网络的汉语篇章主次关系识别方法[J]. 中文信息学报,2019,33(5):39-46.
[14]孙振华,周懿,朱巧明,等. 基于篇章主题的中文宏观篇章主次关系识别方法[J]. 中文信息学报,2020,34(12):30-38.
[15]Mikolov T,Sutskever I,Chen K,et al. Distributed representations of words and phrases and their compositionality[C]//Conference on Neural Information Processing Systems. Lake Tahoe:Neural Information Processing Systems Foundation,2013:3111-3119.
[16]Devlin J,Chang M W,Lee K,et al. Bert:Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of NAACL-HLT 2019. Minneapolis:Association for Computational Linguistics,2019:4171-4186.
[17]Brown T,Mann B,Ryder N,et al. Language models are few-shot learners[C]//Conference on Neural Information Processing Systems. Vancouver:Neural Information Processing Systems Foundation,2020:1877-1901.
[18]杨紫怡,贡正仙,孔芳,等. 基于中英文可比较语料的中文零指代消解[J]. 北京大学学报(自然科学版),2017,53(2):279-286.
[19]Wang L Y,Tu Z P,Zhang X J,et al. A novel approach to dropped pronoun translation[C]//North American Chapter of the Association for Computational Linguistics. San Diego:Association for Computational Linguistics,2016:983-993.
[20]Wang L Y,Tu Z P,Zhang X J,et al. A novel and robust approach for pro-drop language translation[J]. Machine Translation,2017,31(1):65-87.
[21]Wang L Y,Tu Z P,Shi S M,et al. Translating pro-drop languages with reconstruction models[C]//Proceedings of the AAAI Conference on Artificial Intelligence. New Orleans:AAAI Press,2018:4937-4945.
[22]冯文贺,李艳翠,任函,等. 汉英篇章结构平行语料库的对齐标注评估[J]. 中文信息学报,2017,31(3):86-93.
[23]张龙印. 融合多层次知识的篇章修辞结构解析研究[D]. 苏州:苏州大学,2022.
[24]Pires T,Schlinger E,Garrette D. How multilingual is multilingual BERT?[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence:Association for Computational Linguistics,2019:4996-5001.
[25]Conneau A,Khandelwal K,Goyal N,et al. Unsupervised cross-lingual representation learning at scale[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online:Association for Computational Linguistics,2020:8440-8451.
[26]Vaswani A,Shazeer N,Parmar N,et al. Attention is all you need[C]//31st Conference on Neural Information Processing Systems. Long Beach:Neural Information Processing Systems Foundation,2017:5998-6008.
[27]冯文贺. 汉英篇章结构平行语料库的对齐标注研究[J]. 中文信息学报,2013,27(6):158-164.
[28]Xue L,Constant N,Roberts A,et al. Mt5:A massively multilingual pre-trained text-to-text transformer[C]//North American Chapter of the Association for Computational Linguistics. Online:Association for Computational Linguistics,2021:483-498.

备注/Memo

备注/Memo:
收稿日期:2025-03-24.
基金项目:教育部人文社会科学研究项目(22YJCZH091)、河南省科技攻关项目(252102210102、262102210084)、河南省自然科学基金项目(262300421797).
通讯作者:苗国义,博士,研究方向:自然语言处理. E-mail:miaoguoyi@htu.edu.cn
更新日期/Last Update: 2026-04-10