search iconsearch icon
Type something to search...

Transcribing Audios with Whisper and Structuring Data with ChatGPT

Transcribing Audios with Whisper and Structuring Data with ChatGPT

0. Motivation

At work we wanted to extract information from calls done using a CTI. This way we can improve the service we offer by getting insights from the call and by providing suggestions to agents. We can do that using different OpenAI models.

1. How to extract insights from an audio with OpenAI

At the time of writing ChatGPT 4o is not able to process multimodal information. That means that we cannot pass an audio together with text to extract insights. To do so, we need to do it with 2 steps:

  1. Transcribing the call with whisper
  2. Extracting insights with ChatGPT

We will do it using Domain LogoOpenAI python package.

First, let’s focus on the transcription.

2. Using Whisper to transcribe the call

With whisper we can transcribe an audio file easily. It can be done with:

import io
import boto3
from openai import OpenAI

s3 = boto3.client("s3")
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))


def get_file(s3_uri):
    # Extract bucket and filename
    bucket, filename = s3_uri.replace("s3://", "").split("/", 1)

    obj = s3.get_object(Bucket=bucket, Key=filename)
    file = io.BytesIO(obj["Body"].read())
    file.name = s3_uri # Needed so that it knows the extension (mp3 mostly)
    return file

def transcribe_audio(client, file)
    return client.audio.transcriptions.create(model="whisper-1", file=file)

file = get_file("s3//your_bucket/your/file/path.mp3")
transcription = transcribe_audio(client, file)

You will need to adapt the file_path and make sure to have the OPENAI_API_KEY. Alternativelly you can retrive the secret from somewhere else.

This code assumes that the file is stored in S3 but it can easily be adapted for local files or for other cloud providers.

3. Handling invalid calls

When processing the first calls, I noticed that we had a lot of super short calls where no transcription could be extracted. Trying to transcribe those calls is a waste of resources.

To solve that I used Domain LogoMutagen to extract the audio length with:

from mutagen.mp3 import HeaderNotFoundError
from mutagen.mp3 import MP3

def get_duration(file):
    """Get duration of the audio file using mutagen"""
    try:
        audio = MP3(file)
        duration_seconds = audio.info.length
    except HeaderNotFoundError:
        logger = get_run_logger()
        logger.error(f"Unable to read the duration of the file: {file.name}")
        duration_seconds = -1

    file.seek(0)  # Reset file pointer after reading
    return duration_seconds

Once extracted, I could only transcribe if the audio length was greater than a minimum value with:

import time
from loguru import logger

MIN_SECONDS = 5 # Adjust if needed


def process_file(client, s3, filename):
    logger.info(f"Processing {filename=}")

    t0 = time.monotonic()
    file = get_file(s3, filename)
    duration_seconds = get_duration(file)

    if duration_seconds > MIN_SECONDS:
        transcription = transcribe_audio(client, file)
        trans_time = time.monotonic() - t0
        logger.info(f"Successfully transcribed {filename=} in {trans_time:.2f} seconds")
    else:
        logger.info(
            f"Skipping transcription for {filename=}, "
            f"{duration_seconds=:.2f} is less than {MIN_SECONDS=}"
        )

    return transcription

4. Adding metadata

I think that is always really useful to have some metadata whenever doing operations in the datalake. It will be specially useful when debugging issues.

That can easily be done with Domain LogoPydantic. The idea is to create model for each file we process.

Here is how you can define the model:

from datetime import datetime
from typing import Optional

from pydantic import BaseModel

class Transcription(BaseModel):
    filename: str
    duration_seconds: float
    transcription: Optional[str] = None
    transcription_time: Optional[float] = None
    transcribed_at: datetime = datetime.now()

With that, we can add this to the process_file function:

process_file

    return Transcription(
        filename=filename,
        duration_seconds=duration_seconds,
        transcription=transcription,
        transcription_time=trans_time,
    )

5. Using ChatGPT to Extract Insights

To extract information, you can make a standard call to ChatGPT. The key here is to leverage the newly introduced structured outputs, which were announced recently (Domain LogoIntroducing Structured Outputs in the API). For more details, refer to the Domain LogoAPI Reference | response_format and Domain LogoStructured Outputs documentation.

Note that there is an option to directly provide a Pydantic class in the API call, which will automatically handle the casting. However, since this feature is still in beta, I prefer not to use it at this time.

5.1. Defining the Prompt with a Pydantic Model

To ensure consistency and avoid code duplication, we’ll start by creating a Domain LogoPydantic class that defines the structure and fields of the output:

from typing import Literal, List, Optional
from pydantic import BaseModel, Field

class CallAnalysis(BaseModel):
    class Config:
        description = "A model to analyze transcribed calls, extracting key information about sentiment, outcomes, and potential improvements."
    
    sentiment: Literal['positive', 'negative'] = Field(
        ..., description="The overall sentiment of the call, either 'positive' or 'negative'."
    )
    summary: str = Field(
        ..., description="A brief summary of the call."
    )
    is_answered: bool = Field(
        ..., description="True if the call was answered, else false."
    )
    successful_sale: bool = Field(
        ..., description="True if the agent was able to sell a course, else false."
    )
    improvements: List[str] = Field(
        ..., description="Suggestions for improving the agent's performance on the call."
    )
    failure_reason: Optional[str] = Field(
        ..., description="The reason for the call's failure to result in a sale. Leave empty or as None if the sale was successful."
    )

With this Pydantic class, you can generate a prompt by using CallAnalysis.model_json_schema():

PROMPT = f"""You are an assistant tasked with analyzing a transcribed call.
Your response must be in English and return a JSON that is parsable with the following Pydantic model:

{CallAnalysis.model_json_schema()}
```"""

While there are many ways to refine and optimize the prompt, those techniques are beyond the scope of this post.

5.2. Calling OpenAI

The key aspect here is correctly handling the response_format when interacting with OpenAI’s API. You can generate the required schema using get_json_schema_from_pydantic based on the pydantic model.

Afterward, you can obtain the output using CLIENT.chat.completions.create.

Finally, to capture relevant metadata about the API call, I extend the Pydantic class by adding some additional fields.

from datetime import datetime
import json

from openai import OpenAI

CLIENT = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))


def get_json_schema_from_pydantic(pydantic_model):
    """
    To force ChatGPT to output json data.
    More info: https://platform.openai.com/docs/api-reference/chat/create#chat-create-response_format
    """
    schema = pydantic_model.model_json_schema()

    # Manually add "additionalProperties": false to the schema
    schema["additionalProperties"] = False

    json_schema = {
        "name": pydantic_model.__name__,
        "description": pydantic_model.Config.description,
        "schema": schema,
        "strict": True,
    }

    return {"type": "json_schema", "json_schema": json_schema}


def extract_info(transcription, pydantic_model=CallAnalysis)
    messages = [
        {"role": "system", "content": PROMPT},
        {"role": "user", "content": transcription}
    ]

    response_format = get_json_schema_from_pydantic(pydantic_model)

    chat_completion = CLIENT.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        response_format=response_format,
    )

    class ChatOutput(pydantic_model):
        """
        This class is an extension of the input `pydantic_model`
        It adds multiple metadata meant for debugging and issues resolution
        """

        gpt_called_at: datetime = datetime.now()
        gpt_raw_input: str = transcription
        gpt_model: str = data["model"]
        id: Optional[str]

    result = chat_completion.choices[0].message.content
    result_json = json.loads(result)

    return ChatOutput(**result_json)

extract_info("your transcription goes here")

Remember to pass the transcription from the previous step.

6. Processing the outputs

Right now we have seen how to process one file. For now we can process multiple files with a single for loop like:

results = []
for file in files:
    result = await process_file(client, s3, file)
    results.append(result)

This example demonstrates the process for handling transcriptions. The extraction of insights can be achieved in a very similar manner.

results will be a list of Transcription or ChatOutput pydantic models. We can easily transform that to a pandas dataframe with:

df = pd.DataFrame([x.dict() for x in results])
df["p_extracted_at"] = datetime.now()

As usual, we add the p_extracted_at column as the partition column. You can read more about this in Domain LogoSelf-Healing Pipelines

6.1. Exporting the dataframe

Here we can use Domain LogoAWSwrangler to export the result as a table. This can be done with the following functions:

import awswrangler as wr
from loguru import logger

def get_table_path(bucket_prefix, db_path, table):
    """Get the table path"""
    buckets = [x for x in wr.s3.list_buckets() if x.startswith(bucket_prefix)]
    return f"s3://{buckets[0]}/{db_path}/{table}"

def write_parquet(
    df,
    database,
    table,
    bucket_prefix,
    mode="append",
    partition_col=PARTITION_COL,
    session=None,
):
    n_records = df.shape[0]

    db_path = database.split("__")[1]
    path = get_table_path(bucket_prefix=bucket_prefix, db_path=db_path, table=table)

    logger.info(f"Writing {n_records} rows to table '{database}.{table}' ({path=})")

    kwargs = {}
    if session is not None:
        kwargs["boto3_session"] = session

    wr.s3.to_parquet(
        df=df,
        dataset=True,
        database=database,
        table=table,
        path=path,
        mode=mode,
        partition_cols=[partition_col],
        **kwargs,
    )

7. Conclusion

In this post, we explored how to use ChatGPT to extract structured insights from transcriptions by leveraging Pydantic models and OpenAI’s API. This approach allows for clear and consistent outputs, making it easier to automate and analyze data.

In a future post, I’ll dive into how to handle asynchronous concurrent calls using a RateLimiter to ensure efficiency and compliance with API rate limits. Stay tuned for more advanced techniques to optimize your workflows with ChatGPT!