In an article recently submitted to the ArXiv* server, researchers proposed an approach based on explicit hypothesis formation to improve the inductive reasoning ability of large language models (LLMs) on complex tasks.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Background
Inductive reasoning refers to the ability to identify the general/underlying principles from a few examples and then apply these principles to novel scenarios. Recently, pre-trained LLMs have gained significant attention for application in different reasoning tasks, including symbolic, arithmetic, and commonsense reasoning, with several studies evaluating LLMs on inductive reasoning tasks.
Although LLMs can be used effectively for straightforward inductive tasks based on the direct prompting approach, these models display poor performance on more complex tasks, such as Abstraction and Reasoning Corpus (ARC). ARC is a difficult visual inductive reasoning benchmark as a training input-output pair set with a shared transformation rule is given to models for every task in ARC, and the goal of the model is to make predictions about the corresponding outputs for the novel test inputs.
Moreover, the answers in ARC also require a precise and complex transformation. Studies showed that LLMs predict outputs in ARC based on in-context learning, leading to poor inductive reasoning performance compared to humans.
Improving the LLM inductive reasoning ability
In this paper, researchers proposed to improve the LLM inductive reasoning ability by generating explicit hypotheses/decomposing the task through hypothesis formation at two levels of abstraction. Initially, the LLM was prompted to propose multiple abstract hypotheses about a problem in natural language, and then the natural language hypotheses were implemented as specific Python programs, which were used to make predictions.
Although natural language can provide abstract representations that reveal key features, it is potentially ambiguous and difficult to verify. Programmatic hypotheses can be verified directly on examples through execution and can simply be generalized to new inputs. However, these hypotheses involve several implementation details that can distract a language model.
The researchers used specific programmatic implementations as a generalizable, precise representation of a given natural language-formulated inductive hypothesis. Thus, the method proposed by researchers disentangles the inductive reasoning tasks into two abilities, including the ability to propose precise natural language hypotheses about underlying rules and formalize these rules as programs.
However, LLMs are not expected to generate a good hypothesis in a single attempt in real-world situations. Although sampling several hypotheses and programs per hypothesis can address this problem, such an approach can be extremely expensive. Thus, a middle step must be included to obtain the most promising hypotheses by filtering the generated hypotheses.
The researchers investigated several approaches to effectively decrease the number of hypotheses that will be implemented into programs. The first approach was to ask the LLM to summarize several hypotheses into a smaller number of hypotheses, while the second approach involved querying a human oracle/annotator to evaluate all generated hypotheses and select the correct hypotheses.
Evaluation of the proposed method
Researchers prompted the Generative Pre-trained Transformer 4 (GPT-4) to generate natural language hypotheses for inductive reasoning problems and then sampled multiple responses from the LLM with 1.0 temperature as the hypothesis candidates. Subsequently, the most promising hypotheses were identified from the candidate hypotheses using LLMs or human annotators.
A set of candidate hypotheses was obtained for each problem, and then, every hypothesis was used individually as the GPT-4 input. The LLM was prompted to generate several Python programs that can efficiently implement the described transformation.
Eventually, researchers run these generated programs against the original input-output examples of the problem to determine whether these programs generate accurate outputs for each case. If the outputs for all training examples were correctly generated by a code implementation, then that implementation was selected to generate a prediction on the test input example.
If all training examples were not passed by any of the generated code implementations, then GPT-4 was prompted to revise the implementations based on the training set execution results. The accuracy achieved by GPT-4 while making predictions on the test input cases was determined to measure the performance of various methods, including the proposed method and direct prompting approach.
Researchers performed experiments on three inductive reasoning datasets, including ARC, the Syntax-Guided Synthesis (SyGuS) dataset, and the one-dimensional variant of ARC (1D-ARC), to verify the effectiveness of their proposed method.
Significance of the study
The explicit hypothesis formation approach substantially improved the performance of the LLM/GPT-4 over the direct prompting approach. The GPT-4 model using the proposed method and LLM summarized candidate hypotheses achieved 27.5% accuracy on a random 40-problem subset of ARC.
Moreover, GPT-4, using the proposed method and human annotator-selected hypotheses, showed an even higher 37.5% accuracy on the same 40-problem subset of ARC. Both figures were significantly higher than the 12.5% accuracy achieved by the model using the baseline direct prompting approach.
In the 1D-ARC dataset, GPT-4 using the proposed method displayed 77.8% accuracy compared to 38.8% accuracy in the direct prompting method. In 1D-ARC experiments, the reduction in the number of initially generated hypotheses was not required as reasonably correct hypotheses were obtained by generating only 16 hypothesis candidates. In the SyGus dataset, GPT-4 generated correct programs for 94.3% of the SyGuS tasks using eight programs with two rounds of feedback without hypothesis generation, displaying a performance in a direct program generation approach.
To summarize, the findings of this study demonstrated that the proposed approach based on explicit hypothesis formation can effectively outperform the baseline method on all three inductive reasoning datasets. Both levels of abstraction, including natural-language hypothesis generation and programmatic hypothesis representations, were beneficial for performing inductive reasoning tasks.
However, the inability of the LLM to generate a sufficiently precise natural language hypothesis and the possibility of the LLMs generating incorrect programs even after inputting a correct hypothesis are the major limitations of the proposed method.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Wang, R., Zelikman, E., Poesia, G., Pu, Y., Haber, N., Goodman, N. D. (2023). Hypothesis Search: Inductive Reasoning with Language Models. ArXiv. https://doi.org/10.48550/arXiv.2309.05660, https://arxiv.org/abs/2309.05660