6.1. LLMs as Tax Attorneys

In this section we go over the following research paper:

By: John J. Nay, David Karamardian, Sarah B. Lawsky, Wenting Tao, Meghana Bhat, Raghav Jain, Aaron Travis Lee, Jonathan H. Choi, Jungo Kasai

Available at: https://arxiv.org/abs/2306.07075

Introduction

The authors of this paper conducted a study to determine whether expert tax attorneys could potentially be replaced with currently available AI models. Tax law was chosen as the subject of this study due to its intricate structure and the necessity for logical reasoning and mathematical skills in its application. Additionally, the legal authority in tax law is principally concentrated in two sources: the Treasury Regulations under the CFR and Title 26 of the U.S. Code (also called the Internal Revenue Code) making it logistically easier to include correct legal texts for reference.

Another motivating factor for choosing tax law is the fact that it's deeply intertwined with the real-world economic lives of citizens and companies, making the implications of this study highly relevant and far-reaching.

Hypothesis

The authors hypothesize that LLMs, particularly when combined with prompting enhancements and the correct legal texts, can perform at high levels of accuracy but not yet at expert tax lawyer levels. They also suggest that as LLMs continue to advance, their ability to reason about law autonomously could have significant implications for the legal profession and AI governance.

You can get a sense for how this study was structured in the diagram below.

Evaluation (Prompt Engineering)

The authors employed a variety of prompt engineering techniques to enhance the performance of Large Language Models (LLMs) in the context of tax law. These techniques included:

  1. Chain-of-Thought (CoT) Prompting: This technique involves asking the LLM to think through its response step-by-step. The idea is to encourage the model to generate more reasoned and thoughtful responses. However, the results showed that CoT prompting did not consistently improve results for all models and retrieval methods. It did, however, boost the performance of GPT-4, suggesting that an LLM might need to have a certain capability level to exhibit improved performance through additional reasoning.

  2. Few-Shot Prompting: In this approach, the LLM is provided with a set of three other question-answer pair examples, along with the question being asked. This is designed to give the model a context and a pattern to follow when generating its own response. The authors found that few-shot prompting significantly improved results for GPT-4 and was less consistently useful for weaker models.

  3. Self-Reflection and Self-Refinement Techniques: These advanced techniques involve prompting the LLM with its own answers and the relevant context, and then asking it to identify any ambiguities in the question or to doubt its current answer. The response can then be used to conduct further retrieval augmented generation. While the paper does not provide specific results for these techniques, they are identified as prime candidates for increasing performance.

  4. Document Retrieval: The authors experimented with different retrieval methods, each with its own prompt template that provides different supporting context to the LLM. They found that providing the LLM with more legal text and more relevant legal text weakly increased accuracy for most models.

Overall, the results indicated that the effectiveness of these prompt engineering techniques varied depending on the specific LLM and the context. However, they all contributed to enhancing the LLM's ability to reason about tax law and generate accurate responses.

Evaluation (Models)

The paper evaluated the performance of four increasingly advanced Large Language Models (LLMs) released by OpenAI over the past three years. The findings for each model are as follows:

  1. GPT-4: This was the most advanced model evaluated in the study. The authors found that GPT-4 benefited significantly from the Chain-of-Thought (CoT) prompting technique, which asks the LLM to think through its response step-by-step. This suggests that an LLM might need to have a certain capability level to exhibit improved performance through additional reasoning. GPT-4 also showed significant improvement with few-shot prompting, where a set of three other question-answer pair examples are provided to the LLM, along with the question being asked. Furthermore, GPT-4 showed a clear performance boost when fed with the "gold truth" legal documents, rather than performing similarity search to extract the relevant documents from a vector database.

  2. GPT-3.5: This model was trained with supervised fine-tuning instead of reinforcement learning from human feedback (RLHF). The results for GPT-3.5 were less consistent than for GPT-4. Few-shot prompting was less useful for this model, and the benefits of CoT prompting were not as pronounced.

  3. GPT-3 (davinci): This is the "most capable" GPT-3 model according to OpenAI. The performance of GPT-3 was consistently outperformed by the newer models, indicating the advancements in LLM technology over time. The benefits of advanced prompting techniques were also less pronounced for this model.

  4. GPT-3 (text-davinci-002): This is an earlier version of GPT-3.5 that is "trained with supervised fine-tuning instead of RLHF". The performance of this model was similar to that of GPT-3 (davinci), and it was consistently outperformed by the newer models.

Overall, the study found that the primary experimental factor causing consistent increases in accuracy was the underlying LLM being used. Newer models consistently outperformed older models, indicating the rapid advancements in LLM technology.

Results

Senior tax attorneys need not worry about job security for the time being. However, junior tax associates should begin leveraging ChatGPT, or they run the risk of being left behind.

The results of the study provided several key insights into the capabilities of Large Language Models (LLMs) in the context of tax law.

Firstly, the study found that the effectiveness of the Chain-of-Thought (CoT) and few-shot prompting techniques varied depending on the specific LLM and the context. CoT prompting, which encourages the LLM to think through its response step-by-step, boosted the performance of the most advanced model, GPT-4, but did not consistently improve results for all models and retrieval methods. This suggests that an LLM might need to have a certain level of capability to benefit from additional reasoning. Few-shot prompting, which provides the LLM with a set of three other question-answer pair examples along with the question being asked, significantly improved results for GPT-4 but was less consistently useful for weaker models.

Secondly, the study found that providing the LLM with more legal text and more relevant legal text weakly increased accuracy for most models. This indicates that the quality and relevance of the legal texts used in the prompting process can influence the LLM's ability to generate accurate responses.

Finally, and perhaps most importantly, the study found that the primary experimental factor causing consistent increases in accuracy was the underlying LLM being used. Newer models consistently outperformed older models, demonstrating the rapid advancements in LLM technology and their increasing ability to reason about complex subjects like tax law.

These results highlight the potential of LLMs in the legal field, but also underscore the importance of ongoing research and development to further enhance their capabilities. The findings suggest that as LLMs continue to advance, they could play an increasingly significant role in legal services, potentially improving efficiency, reducing costs, and making legal advice more accessible.

Last updated