Boosting Inductive Reasoning in Large Language Models: The Power of Hypothesis Formation

In an article recently submitted to the ArXiv* server, researchers proposed an approach based on explicit hypothesis formation to improve the inductive reasoning ability of large language models (LLMs) on complex tasks.

Study: Boosting Inductive Reasoning in Large Language Models: The Power of Hypothesis Formation. Image credit: Blue Planet Studio/Shutterstock
Study: Boosting Inductive Reasoning in Large Language Models: The Power of Hypothesis Formation. Image credit: Blue Planet Studio/Shutterstock

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.

Background

Inductive reasoning refers to the ability to identify the general/underlying principles from a few examples and then apply these principles to novel scenarios. Recently, pre-trained LLMs have gained significant attention for application in different reasoning tasks, including symbolic, arithmetic, and commonsense reasoning, with several studies evaluating LLMs on inductive reasoning tasks.

Although LLMs can be used effectively for straightforward inductive tasks based on the direct prompting approach, these models display poor performance on more complex tasks, such as Abstraction and Reasoning Corpus (ARC). ARC is a difficult visual inductive reasoning benchmark as a training input-output pair set with a shared transformation rule is given to models for every task in ARC, and the goal of the model is to make predictions about the corresponding outputs for the novel test inputs.

Moreover, the answers in ARC also require a precise and complex transformation. Studies showed that LLMs predict outputs in ARC based on in-context learning, leading to poor inductive reasoning performance compared to humans.

Improving the LLM inductive reasoning ability

In this paper, researchers proposed to improve the LLM inductive reasoning ability by generating explicit hypotheses/decomposing the task through hypothesis formation at two levels of abstraction. Initially, the LLM was prompted to propose multiple abstract hypotheses about a problem in natural language, and then the natural language hypotheses were implemented as specific Python programs, which were used to make predictions.

Although natural language can provide abstract representations that reveal key features, it is potentially ambiguous and difficult to verify. Programmatic hypotheses can be verified directly on examples through execution and can simply be generalized to new inputs. However, these hypotheses involve several implementation details that can distract a language model.

The researchers used specific programmatic implementations as a generalizable, precise representation of a given natural language-formulated inductive hypothesis. Thus, the method proposed by researchers disentangles the inductive reasoning tasks into two abilities, including the ability to propose precise natural language hypotheses about underlying rules and formalize these rules as programs.

However, LLMs are not expected to generate a good hypothesis in a single attempt in real-world situations. Although sampling several hypotheses and programs per hypothesis can address this problem, such an approach can be extremely expensive. Thus, a middle step must be included to obtain the most promising hypotheses by filtering the generated hypotheses.

The researchers investigated several approaches to effectively decrease the number of hypotheses that will be implemented into programs. The first approach was to ask the LLM to summarize several hypotheses into a smaller number of hypotheses, while the second approach involved querying a human oracle/annotator to evaluate all generated hypotheses and select the correct hypotheses.

Evaluation of the proposed method

Researchers prompted the Generative Pre-trained Transformer 4 (GPT-4) to generate natural language hypotheses for inductive reasoning problems and then sampled multiple responses from the LLM with 1.0 temperature as the hypothesis candidates. Subsequently, the most promising hypotheses were identified from the candidate hypotheses using LLMs or human annotators.

A set of candidate hypotheses was obtained for each problem, and then, every hypothesis was used individually as the GPT-4 input. The LLM was prompted to generate several Python programs that can efficiently implement the described transformation.

Eventually, researchers run these generated programs against the original input-output examples of the problem to determine whether these programs generate accurate outputs for each case. If the outputs for all training examples were correctly generated by a code implementation, then that implementation was selected to generate a prediction on the test input example.

If all training examples were not passed by any of the generated code implementations, then GPT-4 was prompted to revise the implementations based on the training set execution results. The accuracy achieved by GPT-4 while making predictions on the test input cases was determined to measure the performance of various methods, including the proposed method and direct prompting approach.

Researchers performed experiments on three inductive reasoning datasets, including ARC, the Syntax-Guided Synthesis (SyGuS) dataset, and the one-dimensional variant of ARC (1D-ARC), to verify the effectiveness of their proposed method.

Significance of the study

The explicit hypothesis formation approach substantially improved the performance of the LLM/GPT-4 over the direct prompting approach. The GPT-4 model using the proposed method and LLM summarized candidate hypotheses achieved 27.5% accuracy on a random 40-problem subset of ARC.

Moreover, GPT-4, using the proposed method and human annotator-selected hypotheses, showed an even higher 37.5% accuracy on the same 40-problem subset of ARC. Both figures were significantly higher than the 12.5% accuracy achieved by the model using the baseline direct prompting approach.

In the 1D-ARC dataset, GPT-4 using the proposed method displayed 77.8% accuracy compared to 38.8% accuracy in the direct prompting method. In 1D-ARC experiments, the reduction in the number of initially generated hypotheses was not required as reasonably correct hypotheses were obtained by generating only 16 hypothesis candidates. In the SyGus dataset, GPT-4 generated correct programs for 94.3% of the SyGuS tasks using eight programs with two rounds of feedback without hypothesis generation, displaying a performance in a direct program generation approach.

To summarize, the findings of this study demonstrated that the proposed approach based on explicit hypothesis formation can effectively outperform the baseline method on all three inductive reasoning datasets. Both levels of abstraction, including natural-language hypothesis generation and programmatic hypothesis representations, were beneficial for performing inductive reasoning tasks.

However, the inability of the LLM to generate a sufficiently precise natural language hypothesis and the possibility of the LLMs generating incorrect programs even after inputting a correct hypothesis are the major limitations of the proposed method.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.

Journal reference:
Samudrapom Dam

Written by

Samudrapom Dam

Samudrapom Dam is a freelance scientific and business writer based in Kolkata, India. He has been writing articles related to business and scientific topics for more than one and a half years. He has extensive experience in writing about advanced technologies, information technology, machinery, metals and metal products, clean technologies, finance and banking, automotive, household products, and the aerospace industry. He is passionate about the latest developments in advanced technologies, the ways these developments can be implemented in a real-world situation, and how these developments can positively impact common people.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Dam, Samudrapom. (2023, September 14). Boosting Inductive Reasoning in Large Language Models: The Power of Hypothesis Formation. AZoAi. Retrieved on September 19, 2024 from https://www.azoai.com/news/20230914/Boosting-Inductive-Reasoning-in-Large-Language-Models-The-Power-of-Hypothesis-Formation.aspx.

  • MLA

    Dam, Samudrapom. "Boosting Inductive Reasoning in Large Language Models: The Power of Hypothesis Formation". AZoAi. 19 September 2024. <https://www.azoai.com/news/20230914/Boosting-Inductive-Reasoning-in-Large-Language-Models-The-Power-of-Hypothesis-Formation.aspx>.

  • Chicago

    Dam, Samudrapom. "Boosting Inductive Reasoning in Large Language Models: The Power of Hypothesis Formation". AZoAi. https://www.azoai.com/news/20230914/Boosting-Inductive-Reasoning-in-Large-Language-Models-The-Power-of-Hypothesis-Formation.aspx. (accessed September 19, 2024).

  • Harvard

    Dam, Samudrapom. 2023. Boosting Inductive Reasoning in Large Language Models: The Power of Hypothesis Formation. AZoAi, viewed 19 September 2024, https://www.azoai.com/news/20230914/Boosting-Inductive-Reasoning-in-Large-Language-Models-The-Power-of-Hypothesis-Formation.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.