Build awareness and adoption for your software startup with Circuit.

Structured Data from LLMs — Langchain and Pydantic Output Parser

Image of a neural network floating above a purple table

Understanding Pydantic Output Parser

The Pydantic output parser is a tool that allows users to define a JSON schema to query LLMs for outputs that adhere to that schema. This is pivotal for applications that require structured data, as it ensures outputs conform to predefined formats. The parser leverages Pydantic’s BaseModel for data validation and type checking, ensuring the data extracted is not only structurally sound but also type-accurate.

Extracting Structured Data from LLMs

After setting up the data model, Langchain’s output parser can be used to generate structured data. Here’s how it works:

  1. Define the query, instructing the LLM to analyze a block of code for security risks.
  2. Use Langchain’s PromptTemplate to format the query along with format instructions derived from the Pydantic parser.
  3. Query the LLM and parse the output using the PydanticOutputParser to get a structured JSON response.

Setting Up Pydantic with Langchain

First to get start, make sure you have the langchain and openai pip packages installed. Install both with pip install langchain openai .

To utilize this parser, one must define the data structure using Pydantic’s BaseModel. Here is an example where to identify potential security risks in a code snippet:

from typing import List
from pydantic import BaseModel, Field, validator
  
class CodeRisk(BaseModel):
    description: str = Field(description="A brief description of the security risk")
    severity: str = Field(description="The level of severity of the risk")
    recommendations: List[str] = Field(description="Recommended actions to mitigate the risk")
    
    @validator("severity")
    def severity_must_be_valid(cls, field):
        if field not in ["Low", "Medium", "High", "Critical"]:
            raise ValueError("Invalid severity level!")
        return field

The BaseModel CodeRisk is defined to structure the output from the LLM, which includes a description of the security risk, its severity, and recommendations for mitigation.

Implementation Example: Identifying Security Risks in Code

After defining our CodeRisk data model using Pydantic, we now prepare the query and parse the output from the LLM. Here''s how you can set up the entire flow (add your OpenAI API key):

from langchain.llms import OpenAI
from langchain.output_parsers import PydanticOutputParser
from langchain.prompts import PromptTemplate
from langchain.pydantic_v1 import BaseModel, Field, validator

model_name = "text-davinci-003"
temperature = 0.0
model = OpenAI(
  model_name=model_name,
  temperature=temperature,
  openai_api_key="ENTER YOUR KEY HERE",
)

code_snippet_query = """
Analyze the following block of code for potential security risks:
public void authenticate(String username, String password) {
    if (username.equals("admin") && password.equals("password123")) {
    // Authentication logic
    }
}
"""

parser = PydanticOutputParser(pydantic_object=CodeRisk)
prompt = PromptTemplate(
    template="Identify any potential security risks in the code snippet provided.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)
_input = prompt.format_prompt(query=code_snippet_query)
output = model(_input.to_string())

# The parsed_output variable now contains the structured data as a Pydantic model instance.
parsed_output = parser.parse(output)

With this setup, the LLM is directed to analyze the provided code and return any identified security risks in a structured format. Here is an example of what the output might look like:

{
  "description": "Hardcoded credentials found in the authentication method.",
  "severity": "High",
  "recommendations": [
    "Avoid using hardcoded credentials.",
    "Implement a more secure authentication method such as OAuth or JWT.",
    "Use environment variables or a secure vault for storing sensitive information."
  ]
}

This output is a JSON object conforming to the CodeRisk schema, providing a clear description of the identified risk, its severity, and actionable recommendations.

Use Cases and Impact

The applications of the Pydantic output parser within Langchain are vast. Here are some use cases:

  • Automated Code Reviews: Software engineers can automate preliminary security checks, extracting structured data about potential vulnerabilities.
  • Data Structuring: Convert unstructured data into JSON format, making it ready for databases or further processing.
  • Decision Making: Businesses can structure customer feedback or market data for automated decision-making processes.

This structured approach can be a game-changer in many domains, as it streamlines the process of gleaning actionable insights from vast amounts of natural language data.

Conclusion

By leveraging Langchain with Pydantic, developers and businesses alike can enhance their AI-powered applications, making them more robust and efficient. The Pydantic output parser is indeed a powerful tool that unlocks new potentials for structured data extraction from LLMs.

Follow me for future articles about LLMs and how to harness their potential.

Check out my YouTube video summarizer API SumVid to summarize any length YouTube video and answer any question about the video content.

Also, check out my article on using OpenAI’s new JSON response type feature to force a JSON response.




Continue Learning