DPRM: A Dual Implicit Process Reward Model in Multi-Hop Question Answering
PositiveArtificial Intelligence
The recent introduction of the Dual Implicit Process Reward Model (DPRM) marks a significant advancement in multi-hop question answering (MHQA) tasks. By combining Chain of Thought (CoT) and Knowledge Graphs (KGs), the DPRM improves the quality of generated answers while reducing hallucinations through semantic matching. Traditional Outcome Reward Models (ORMs) provide feedback only after final answers are generated, while Process Reward Models (PRMs) evaluate reasoning processes but require expensive human annotations. The DPRM overcomes these limitations by training two implicit PRMs specifically for CoT and KG reasoning, allowing for a more efficient evaluation of the reasoning process without the need for explicit annotations. This dual approach not only enhances the consistency between CoT and KG reasoning steps but also positions the DPRM as a more suitable model for the complexities of MHQA tasks, ultimately advancing the capabilities of AI language models.
— via World Pulse Now AI Editorial System