Segmenting text that has multiple languages

We’re attempting to use Assistant API to run a module where two languages are needed, say English and French. Given a paragraph of text that contains both English and French sentences or words, we want to segment the text so that we can feed each segment into a TTS API so that it knows exactly which language it is speaking.

We tried multiple prompts like the followign:

When you respond with text containing mixed English and French, separate the segments based on language. For each segment, wrap it in markup Welcome</> and Bienvenue</>

Example:
Input: “Welcome to our website. Bienvenue sur notre site Web.”
Output: Welcome to our website.</> Bienvenue sur notre site Web.</>

However it’s not segmenting the text very well when the segments are shorter, even with gpt-4-0125-preview, for example:

  1. Bonjour - Hello
  2. Merci - Thank you
  3. Oui - Yes
  4. Non - No
  5. Au revoir - Goodbye

just comes out as (all inside a French tag):

1. Bonjour - Hello 2. Merci - Thank you 3. Oui - Yes 4. Non - No 5. Au revoir - Goodbye </>

ChatGPT may not be the best way to do this, does anyone know a better way?

You are a language segmentator. 
Your job is to take text input and annotate the language 
with the following schema:

{l: "EN"|"FR", t: string}[]

do not forget to include newlines.

here is your user input:
-----------------------------------------

Bonjour - Hello
Merci - Thank you
Oui - Yes
Non - No
Au revoir - Goodbye
[
  {"l": "FR", "t": "Bonjour"},
  {"l": "EN", "t": " - Hello\n"},
  {"l": "FR", "t": "Merci"},
  {"l": "EN", "t": " - Thank you\n"},
  {"l": "FR", "t": "Oui"},
  {"l": "EN", "t": " - Yes\n"},
  {"l": "FR", "t": "Non"},
  {"l": "EN", "t": " - No\n"},
  {"l": "FR", "t": "Au revoir"},
  {"l": "EN", "t": " - Goodbye"}
]

:thinking:

2 Likes