Transcribing Audios with Whisper and Structuring Data with ChatGPT
0. Motivation
At work we wanted to extract information from calls done using a CTI. This way we can improve the service we offer by getting insights from the call and by providing suggestions to agents. We can do that using different OpenAI models.
1. How to extract insights from an audio with OpenAI
At the time of writing ChatGPT 4o
is not able to process multimodal information. That means that we cannot pass an audio together with text to extract insights. To do so, we need to do it with 2 steps:
- Transcribing the call with
whisper
- Extracting insights with
ChatGPT
We will do it using OpenAI python package.
First, let’s focus on the transcription.
2. Using Whisper
to transcribe the call
With whisper
we can transcribe an audio file easily.
It can be done with:
import io
import boto3
from openai import OpenAI
s3 = boto3.client("s3")
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
def get_file(s3_uri):
# Extract bucket and filename
bucket, filename = s3_uri.replace("s3://", "").split("/", 1)
obj = s3.get_object(Bucket=bucket, Key=filename)
file = io.BytesIO(obj["Body"].read())
file.name = s3_uri # Needed so that it knows the extension (mp3 mostly)
return file
def transcribe_audio(client, file)
return client.audio.transcriptions.create(model="whisper-1", file=file)
file = get_file("s3//your_bucket/your/file/path.mp3")
transcription = transcribe_audio(client, file)
You will need to adapt the file_path
and make sure to have the OPENAI_API_KEY
. Alternativelly you can retrive the secret from somewhere else.
This code assumes that the file is stored in S3
but it can easily be adapted for local files or for other cloud providers.
3. Handling invalid calls
When processing the first calls, I noticed that we had a lot of super short calls where no transcription could be extracted. Trying to transcribe those calls is a waste of resources.
To solve that I used Mutagen to extract the audio length with:
from mutagen.mp3 import HeaderNotFoundError
from mutagen.mp3 import MP3
def get_duration(file):
"""Get duration of the audio file using mutagen"""
try:
audio = MP3(file)
duration_seconds = audio.info.length
except HeaderNotFoundError:
logger = get_run_logger()
logger.error(f"Unable to read the duration of the file: {file.name}")
duration_seconds = -1
file.seek(0) # Reset file pointer after reading
return duration_seconds
Once extracted, I could only transcribe if the audio length was greater than a minimum value with:
import time
from loguru import logger
MIN_SECONDS = 5 # Adjust if needed
def process_file(client, s3, filename):
logger.info(f"Processing {filename=}")
t0 = time.monotonic()
file = get_file(s3, filename)
duration_seconds = get_duration(file)
if duration_seconds > MIN_SECONDS:
transcription = transcribe_audio(client, file)
trans_time = time.monotonic() - t0
logger.info(f"Successfully transcribed {filename=} in {trans_time:.2f} seconds")
else:
logger.info(
f"Skipping transcription for {filename=}, "
f"{duration_seconds=:.2f} is less than {MIN_SECONDS=}"
)
return transcription
4. Adding metadata
I think that is always really useful to have some metadata whenever doing operations in the datalake. It will be specially useful when debugging issues.
That can easily be done with Pydantic.
The idea is to create model
for each file we process.
Here is how you can define the model:
from datetime import datetime
from typing import Optional
from pydantic import BaseModel
class Transcription(BaseModel):
filename: str
duration_seconds: float
transcription: Optional[str] = None
transcription_time: Optional[float] = None
transcribed_at: datetime = datetime.now()
With that, we can add this to the process_file
function:
process_file
return Transcription(
filename=filename,
duration_seconds=duration_seconds,
transcription=transcription,
transcription_time=trans_time,
)
5. Using ChatGPT
to Extract Insights
To extract information, you can make a standard call to ChatGPT
.
The key here is to leverage the newly introduced structured outputs, which were announced recently (Introducing Structured Outputs in the API).
For more details, refer to the API Reference | response_format and Structured Outputs documentation.
Note that there is an option to directly provide a Pydantic class in the API call, which will automatically handle the casting. However, since this feature is still in beta, I prefer not to use it at this time.
5.1. Defining the Prompt with a Pydantic Model
To ensure consistency and avoid code duplication, we’ll start by creating a Pydantic class that defines the structure and fields of the output:
from typing import Literal, List, Optional
from pydantic import BaseModel, Field
class CallAnalysis(BaseModel):
class Config:
description = "A model to analyze transcribed calls, extracting key information about sentiment, outcomes, and potential improvements."
sentiment: Literal['positive', 'negative'] = Field(
..., description="The overall sentiment of the call, either 'positive' or 'negative'."
)
summary: str = Field(
..., description="A brief summary of the call."
)
is_answered: bool = Field(
..., description="True if the call was answered, else false."
)
successful_sale: bool = Field(
..., description="True if the agent was able to sell a course, else false."
)
improvements: List[str] = Field(
..., description="Suggestions for improving the agent's performance on the call."
)
failure_reason: Optional[str] = Field(
..., description="The reason for the call's failure to result in a sale. Leave empty or as None if the sale was successful."
)
With this Pydantic class, you can generate a prompt by using CallAnalysis.model_json_schema()
:
PROMPT = f"""You are an assistant tasked with analyzing a transcribed call.
Your response must be in English and return a JSON that is parsable with the following Pydantic model:
{CallAnalysis.model_json_schema()}
```"""
While there are many ways to refine and optimize the prompt, those techniques are beyond the scope of this post.
5.2. Calling OpenAI
The key aspect here is correctly handling the response_format
when interacting with OpenAI’s API.
You can generate the required schema using get_json_schema_from_pydantic
based on the pydantic
model.
Afterward, you can obtain the output using CLIENT.chat.completions.create
.
Finally, to capture relevant metadata about the API call, I extend the Pydantic class by adding some additional fields.
from datetime import datetime
import json
from openai import OpenAI
CLIENT = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
def get_json_schema_from_pydantic(pydantic_model):
"""
To force ChatGPT to output json data.
More info: https://platform.openai.com/docs/api-reference/chat/create#chat-create-response_format
"""
schema = pydantic_model.model_json_schema()
# Manually add "additionalProperties": false to the schema
schema["additionalProperties"] = False
json_schema = {
"name": pydantic_model.__name__,
"description": pydantic_model.Config.description,
"schema": schema,
"strict": True,
}
return {"type": "json_schema", "json_schema": json_schema}
def extract_info(transcription, pydantic_model=CallAnalysis)
messages = [
{"role": "system", "content": PROMPT},
{"role": "user", "content": transcription}
]
response_format = get_json_schema_from_pydantic(pydantic_model)
chat_completion = CLIENT.chat.completions.create(
model="gpt-4o",
messages=messages,
response_format=response_format,
)
class ChatOutput(pydantic_model):
"""
This class is an extension of the input `pydantic_model`
It adds multiple metadata meant for debugging and issues resolution
"""
gpt_called_at: datetime = datetime.now()
gpt_raw_input: str = transcription
gpt_model: str = data["model"]
id: Optional[str]
result = chat_completion.choices[0].message.content
result_json = json.loads(result)
return ChatOutput(**result_json)
extract_info("your transcription goes here")
Remember to pass the transcription
from the previous step.
6. Processing the outputs
Right now we have seen how to process one file.
For now we can process multiple files with a single for
loop like:
results = []
for file in files:
result = await process_file(client, s3, file)
results.append(result)
This example demonstrates the process for handling transcriptions. The extraction of insights can be achieved in a very similar manner.
results
will be a list
of Transcription
or ChatOutput
pydantic models.
We can easily transform that to a pandas
dataframe with:
df = pd.DataFrame([x.dict() for x in results])
df["p_extracted_at"] = datetime.now()
As usual, we add the p_extracted_at
column as the partition column. You can read more about this in Self-Healing Pipelines
6.1. Exporting the dataframe
Here we can use AWSwrangler to export the result as a table. This can be done with the following functions:
import awswrangler as wr
from loguru import logger
def get_table_path(bucket_prefix, db_path, table):
"""Get the table path"""
buckets = [x for x in wr.s3.list_buckets() if x.startswith(bucket_prefix)]
return f"s3://{buckets[0]}/{db_path}/{table}"
def write_parquet(
df,
database,
table,
bucket_prefix,
mode="append",
partition_col=PARTITION_COL,
session=None,
):
n_records = df.shape[0]
db_path = database.split("__")[1]
path = get_table_path(bucket_prefix=bucket_prefix, db_path=db_path, table=table)
logger.info(f"Writing {n_records} rows to table '{database}.{table}' ({path=})")
kwargs = {}
if session is not None:
kwargs["boto3_session"] = session
wr.s3.to_parquet(
df=df,
dataset=True,
database=database,
table=table,
path=path,
mode=mode,
partition_cols=[partition_col],
**kwargs,
)
7. Conclusion
In this post, we explored how to use ChatGPT
to extract structured insights from transcriptions by leveraging Pydantic models and OpenAI’s API.
This approach allows for clear and consistent outputs, making it easier to automate and analyze data.
In a future post, I’ll dive into how to handle asynchronous concurrent calls using a RateLimiter
to ensure efficiency and compliance with API rate limits.
Stay tuned for more advanced techniques to optimize your workflows with ChatGPT!