Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: Unstructured-IO/unstructured
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: 0.18.14
Choose a base ref
...
head repository: Unstructured-IO/unstructured
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: 0.18.15
Choose a head ref
  • 5 commits
  • 20 files changed
  • 8 contributors

Commits on Sep 2, 2025

  1. Setup Codeflash Github Actions to optimize all future code (#4082)

    - This Pull Request sets up the `codeflash.yml` file which will run on
    every new Pull Request that modifies the source code for `unstructured`
    directory.
    - We setup the codeflash config in the pyproject.toml file. This defines
    basic project config for codeflash.
    - The workflow uses uv to install the CI dependencies faster than your
    current caching solution. Speed is useful to get quicker optimizations.
    - Please take a look at the requirements that are being installed. Feel
    free to add more to the install list. Codeflash tries to execute code
    and if it is missing a dependency needed to make something run, it will
    fail to optimize.
    - Codeflash is being installed everytime in the CI. This helps the
    workflow always use the latest version of codeflash as it improves
    rapidly. Feel free to add codeflash to dev dependency as well, since we
    are about to release more local optimization tools like VS Code and
    claude code extensions.
    - Feel free to modify this Github action anyway you want
    
    **Actions Required to make this work-**
    
    - Install the Codeflash Github app from [this
    link](https://github.com/apps/codeflash-ai/installations/select_target)
    to this repo. This is required for our github-bot to comment and create
    suggestions on the github repo.
    - Create a new `CODEFLASH_API_KEY` after signing up to [Codeflash from
    our website](https://www.codeflash.ai/). The onboarding will ask you to
    create an API Key and show instructions on how to save the api key on
    your repo secrets.
    
    Then, after this PR is merged in it will start generating new
    optimizations 🎉
    
    ---------
    
    Signed-off-by: Saurabh Misra <[email protected]>
    Co-authored-by: Aseem Saxena <[email protected]>
    Co-authored-by: cragwolfe <[email protected]>
    3 people authored Sep 2, 2025
    Configuration menu
    Copy the full SHA
    e3854d2 View commit details
    Browse the repository at this point in the history

Commits on Sep 9, 2025

  1. Configuration menu
    Copy the full SHA
    1030a69 View commit details
    Browse the repository at this point in the history
  2. ⚡️ Speed up function group_broken_paragraphs by 30% (#4088)

    ### 📄 30% (0.30x) speedup for ***`group_broken_paragraphs` in
    `unstructured/cleaners/core.py`***
    
    ⏱️ Runtime : **`21.2 milliseconds`** **→** **`16.3 milliseconds`** (best
    of `66` runs)
    ### 📝 Explanation and details
    
    Here’s an optimized version of your code, preserving all function
    signatures, return values, and comments.
    **Key improvements:**  
    - **Precompile regexes** inside the functions where they are used
    repeatedly.
    - **Avoid repeated `.strip()` and `.split()`** calls in tight loops by
    working with stripped data directly.
    - **Reduce intermediate allocations** (like unnecessary list comps).
    - **Optimize `all_lines_short` computation** by short-circuiting
    iteration (`any` instead of `all` and negating logic).
    - Minimize calls to regex replace by using direct substitution when
    possible.
    
    
    
    **Summary of key speedups**.
    - Precompiled regex references up-front—no repeated compile.
    - Reordered bullet-matching logic for early fast-path continue.
    - Short-circuit `all_lines_short`: break on the first long line.
    - Avoids unnecessary double stripping/splitting.
    - Uses precompiled regexes even when constants may be strings.
    
    This version will be noticeably faster, especially for large documents
    or tight loops.
    
    
    ✅ **Correctness verification report:**
    
    | Test                        | Status            |
    | --------------------------- | ----------------- |
    | ⚙️ Existing Unit Tests | ✅ **58 Passed** |
    | 🌀 Generated Regression Tests | ✅ **49 Passed** |
    | ⏪ Replay Tests | ✅ **6 Passed** |
    | 🔎 Concolic Coverage Tests | 🔘 **None Found** |
    |📊 Tests Coverage       | 100.0% |
    <details>
    <summary>⚙️ Existing Unit Tests and Runtime</summary>
    
    | Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup |
    
    |:--------------------------------------------------------------------------------------------|:--------------|:---------------|:----------|
    | `cleaners/test_core.py::test_group_broken_paragraphs` | 19.5μs |
    16.1μs | ✅21.0% |
    |
    `cleaners/test_core.py::test_group_broken_paragraphs_non_default_settings`
    | 23.9μs | 21.7μs | ✅10.2% |
    | `partition/test_text.py::test_partition_text_groups_broken_paragraphs`
    | 1.97ms | 1.96ms | ✅0.347% |
    |
    `test_tracer_py__replay_test_0.py::test_unstructured_cleaners_core_group_broken_paragraphs`
    | 161μs | 119μs | ✅34.9% |
    
    </details>
    
    <details>
    <summary>🌀 Generated Regression Tests and Runtime</summary>
    
    ```python
    from __future__ import annotations
    
    import re
    
    # imports
    import pytest  # used for our unit tests
    from unstructured.cleaners.core import group_broken_paragraphs
    
    # Dummy patterns for testing (since unstructured.nlp.patterns is unavailable)
    # These are simplified versions for the sake of testing
    DOUBLE_PARAGRAPH_PATTERN_RE = re.compile(r"\n\s*\n")
    E_BULLET_PATTERN = re.compile(r"^\s*e\s+", re.MULTILINE)
    PARAGRAPH_PATTERN = re.compile(r"\n")
    PARAGRAPH_PATTERN_RE = re.compile(r"\n")
    # Unicode bullets for test
    UNICODE_BULLETS_RE = re.compile(r"^\s*[•○·]", re.MULTILINE)
    from unstructured.cleaners.core import group_broken_paragraphs
    
    # unit tests
    
    # -------------------- BASIC TEST CASES --------------------
    
    def test_empty_string():
        # Test that empty input returns empty string
        codeflash_output = group_broken_paragraphs('') # 1.38μs -> 2.69μs (48.7% slower)
    
    def test_single_line():
        # Test that a single line is returned unchanged
        codeflash_output = group_broken_paragraphs('Hello world.') # 6.58μs -> 6.83μs (3.68% slower)
    
    def test_two_paragraphs_with_double_newline():
        # Test that two paragraphs separated by double newline are preserved
        text = "First paragraph.\nSecond line.\n\nSecond paragraph.\nAnother line."
        expected = "First paragraph. Second line.\n\nSecond paragraph. Another line."
        codeflash_output = group_broken_paragraphs(text) # 13.7μs -> 14.2μs (3.07% slower)
    
    def test_paragraphs_with_single_line_breaks():
        # Test that lines in a paragraph are joined with spaces
        text = "The big red fox\nis walking down the lane.\n\nAt the end of the lane\nthe fox met a bear."
        expected = "The big red fox is walking down the lane.\n\nAt the end of the lane the fox met a bear."
        codeflash_output = group_broken_paragraphs(text) # 18.8μs -> 16.2μs (15.7% faster)
    
    def test_bullet_points():
        # Test bullet points are handled and line breaks inside bullets are joined
        text = "• The big red fox\nis walking down the lane.\n\n• At the end of the lane\nthe fox met a bear."
        expected = [
            "• The big red fox is walking down the lane.",
            "• At the end of the lane the fox met a bear."
        ]
        codeflash_output = group_broken_paragraphs(text); result = codeflash_output # 33.4μs -> 19.7μs (69.7% faster)
    
    def test_e_bullet_points():
        # Test pytesseract e-bullet conversion is handled
        text = "e The big red fox\nis walking down the lane.\n\ne At the end of the lane\nthe fox met a bear."
        # e should be converted to ·
        expected = [
            "· The big red fox is walking down the lane.",
            "· At the end of the lane the fox met a bear."
        ]
        codeflash_output = group_broken_paragraphs(text); result = codeflash_output # 27.8μs -> 16.9μs (64.3% faster)
    
    def test_short_lines_not_grouped():
        # Test that lines with <5 words are not grouped
        text = "Apache License\nVersion 2.0, January 2004\nhttp://www.apache.org/licenses/"
        expected = "Apache License\nVersion 2.0, January 2004\nhttp://www.apache.org/licenses/"
        codeflash_output = group_broken_paragraphs(text) # 10.5μs -> 11.5μs (8.37% slower)
    
    def test_mixed_bullet_and_normal():
        # Test that a mix of bullets and normal paragraphs works
        text = (
            "• First bullet\nis split\n\n"
            "A normal paragraph\nwith line break.\n\n"
            "• Second bullet\nis also split"
        )
        expected = [
            "• First bullet is split",
            "A normal paragraph with line break.",
            "• Second bullet is also split"
        ]
        codeflash_output = group_broken_paragraphs(text); result = codeflash_output # 31.2μs -> 21.3μs (46.3% faster)
    
    # -------------------- EDGE TEST CASES --------------------
    
    def test_all_whitespace():
        # Test input of only whitespace returns empty string
        codeflash_output = group_broken_paragraphs('   \n   ') # 3.52μs -> 4.19μs (16.1% slower)
    
    def test_only_newlines():
        # Test input of only newlines returns empty string
        codeflash_output = group_broken_paragraphs('\n\n\n') # 2.44μs -> 3.46μs (29.7% slower)
    
    def test_single_bullet_with_no_linebreaks():
        # Test bullet point with no line breaks is preserved
        text = "• A bullet point with no line breaks."
        codeflash_output = group_broken_paragraphs(text) # 15.3μs -> 8.46μs (81.1% faster)
    
    def test_paragraph_with_multiple_consecutive_newlines():
        # Test that multiple consecutive newlines are treated as paragraph breaks
        text = "First para.\n\n\nSecond para.\n\n\n\nThird para."
        expected = "First para.\n\nSecond para.\n\nThird para."
        codeflash_output = group_broken_paragraphs(text) # 11.4μs -> 11.6μs (1.56% slower)
    
    def test_leading_and_trailing_newlines():
        # Test that leading and trailing newlines are ignored
        text = "\n\nFirst para.\nSecond line.\n\nSecond para.\n\n"
        expected = "First para. Second line.\n\nSecond para."
        codeflash_output = group_broken_paragraphs(text) # 11.9μs -> 12.5μs (4.58% slower)
    
    def test_bullet_point_with_leading_spaces():
        # Test bullet with leading whitespace is handled
        text = "   • Bullet with leading spaces\nand a line break."
        expected = "• Bullet with leading spaces and a line break."
        codeflash_output = group_broken_paragraphs(text) # 18.4μs -> 10.6μs (73.3% faster)
    
    def test_unicode_bullets():
        # Test that various unicode bullets are handled
        text = "○ Unicode bullet\nline two.\n\n· Another unicode bullet\nline two."
        expected = [
            "○ Unicode bullet line two.",
            "· Another unicode bullet line two."
        ]
        codeflash_output = group_broken_paragraphs(text); result = codeflash_output # 27.7μs -> 15.7μs (75.8% faster)
    
    def test_short_lines_with_blank_lines():
        # Test that short lines with blank lines are preserved and not grouped
        text = "Title\n\nSubtitle\n\n2024"
        expected = "Title\n\nSubtitle\n\n2024"
        codeflash_output = group_broken_paragraphs(text) # 9.66μs -> 10.1μs (4.73% slower)
    
    def test_mixed_short_and_long_lines():
        # Test a paragraph with both short and long lines
        text = "Title\nThis is a long line that should be grouped with the next.\nAnother long line."
        expected = "Title This is a long line that should be grouped with the next. Another long line."
        codeflash_output = group_broken_paragraphs(text) # 14.9μs -> 13.2μs (13.3% faster)
    
    def test_bullet_point_with_inner_blank_lines():
        # Test bullet points with inner blank lines
        text = "• Bullet one\n\n• Bullet two\n\n• Bullet three"
        expected = [
            "• Bullet one",
            "• Bullet two",
            "• Bullet three"
        ]
        codeflash_output = group_broken_paragraphs(text); result = codeflash_output # 24.9μs -> 13.7μs (81.4% faster)
    
    def test_paragraph_with_tabs_and_spaces():
        # Test paragraphs with tabs and spaces are grouped correctly
        text = "First\tparagraph\nis here.\n\n\tSecond paragraph\nis here."
        expected = "First\tparagraph is here.\n\n\tSecond paragraph is here."
        codeflash_output = group_broken_paragraphs(text) # 12.4μs -> 12.4μs (0.314% slower)
    
    # -------------------- LARGE SCALE TEST CASES --------------------
    
    def test_large_number_of_paragraphs():
        # Test function with 500 paragraphs
        paras = ["Paragraph {} line 1\nParagraph {} line 2".format(i, i) for i in range(500)]
        text = "\n\n".join(paras)
        expected = "\n\n".join(["Paragraph {} line 1 Paragraph {} line 2".format(i, i) for i in range(500)])
        codeflash_output = group_broken_paragraphs(text) # 1.79ms -> 1.69ms (5.66% faster)
    
    def test_large_number_of_bullets():
        # Test function with 500 bullet points, each split over two lines
        bullets = ["• Bullet {} part 1\nBullet {} part 2".format(i, i) for i in range(500)]
        text = "\n\n".join(bullets)
        expected = "\n\n".join(["• Bullet {} part 1 Bullet {} part 2".format(i, i) for i in range(500)])
        codeflash_output = group_broken_paragraphs(text) # 3.72ms -> 1.88ms (97.3% faster)
    
    def test_large_mixed_content():
        # Test function with 200 normal paragraphs and 200 bullet paragraphs
        paras = ["Normal para {} line 1\nNormal para {} line 2".format(i, i) for i in range(200)]
        bullets = ["• Bullet {} part 1\nBullet {} part 2".format(i, i) for i in range(200)]
        # Interleave them
        text = "\n\n".join([item for pair in zip(paras, bullets) for item in pair])
        expected = "\n\n".join([
            "Normal para {} line 1 Normal para {} line 2".format(i, i)
            for i in range(200)
        ] + [
            "• Bullet {} part 1 Bullet {} part 2".format(i, i)
            for i in range(200)
        ])
        # Since we interleaved, need to interleave expected as well
        expected = "\n\n".join([
            val for pair in zip(
                ["Normal para {} line 1 Normal para {} line 2".format(i, i) for i in range(200)],
                ["• Bullet {} part 1 Bullet {} part 2".format(i, i) for i in range(200)]
            ) for val in pair
        ])
        codeflash_output = group_broken_paragraphs(text) # 2.48ms -> 1.59ms (55.8% faster)
    
    def test_performance_on_large_text():
        # Test that the function can handle a large block of text efficiently (not a correctness test)
        big_text = "This is a line in a very big paragraph.\n" * 999
        # Should be grouped into a single paragraph with spaces
        expected = " ".join(["This is a line in a very big paragraph."] * 999)
        codeflash_output = group_broken_paragraphs(big_text) # 2.62ms -> 2.62ms (0.161% faster)
    # codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
    
    from __future__ import annotations
    
    import re
    
    # imports
    import pytest  # used for our unit tests
    from unstructured.cleaners.core import group_broken_paragraphs
    
    # Dummy regexes for test purposes (since we don't have unstructured.nlp.patterns)
    DOUBLE_PARAGRAPH_PATTERN_RE = re.compile(r"\n\s*\n")
    E_BULLET_PATTERN = re.compile(r"^e\s")
    PARAGRAPH_PATTERN = re.compile(r"\n")
    PARAGRAPH_PATTERN_RE = re.compile(r"\n")
    UNICODE_BULLETS_RE = re.compile(r"^[\u2022\u2023\u25E6\u2043\u2219\u25AA\u25CF\u25CB\u25A0\u25A1\u25B2\u25B3\u25BC\u25BD\u25C6\u25C7\u25C9\u25CB\u25D8\u25D9\u25E6\u2605\u2606\u2765\u2767\u29BE\u29BF\u25A0-\u25FF]")
    from unstructured.cleaners.core import group_broken_paragraphs
    
    # unit tests
    
    # -------------------------------
    # 1. Basic Test Cases
    # -------------------------------
    
    def test_single_paragraph_joined():
        # Should join lines in a single paragraph into one line
        text = "The big red fox\nis walking down the lane."
        expected = "The big red fox is walking down the lane."
        codeflash_output = group_broken_paragraphs(text) # 11.2μs -> 9.78μs (14.9% faster)
    
    def test_multiple_paragraphs():
        # Should join lines in each paragraph, and keep paragraphs separate
        text = "The big red fox\nis walking down the lane.\n\nAt the end of the lane\nthe fox met a bear."
        expected = "The big red fox is walking down the lane.\n\nAt the end of the lane the fox met a bear."
        codeflash_output = group_broken_paragraphs(text) # 17.7μs -> 15.7μs (13.0% faster)
    
    def test_preserve_double_newlines():
        # Double newlines should be preserved as paragraph breaks
        text = "Para one line one\nPara one line two.\n\nPara two line one\nPara two line two."
        expected = "Para one line one Para one line two.\n\nPara two line one Para two line two."
        codeflash_output = group_broken_paragraphs(text) # 13.8μs -> 14.0μs (1.43% slower)
    
    def test_short_lines_not_joined():
        # Short lines (less than 5 words) should not be joined, but kept as separate lines
        text = "Apache License\nVersion 2.0, January 2004\nhttp://www.apache.org/licenses/"
        expected = "Apache License\nVersion 2.0, January 2004\nhttp://www.apache.org/licenses/"
        codeflash_output = group_broken_paragraphs(text) # 10.7μs -> 11.2μs (4.59% slower)
    
    def test_bullet_points_grouped():
        # Bullet points with line breaks should be joined into single lines per bullet
        text = "• The big red fox\nis walking down the lane.\n\n• At the end of the lane\nthe fox met a bear."
        expected = "• The big red fox is walking down the lane.\n\n• At the end of the lane the fox met a bear."
        codeflash_output = group_broken_paragraphs(text) # 35.4μs -> 21.1μs (68.0% faster)
    
    def test_e_bullet_points_grouped():
        # 'e' as bullet should be replaced and grouped
        text = "e The big red fox\nis walking down the lane."
        expected = "· The big red fox is walking down the lane."
        codeflash_output = group_broken_paragraphs(text) # 17.5μs -> 10.9μs (61.7% faster)
    
    # -------------------------------
    # 2. Edge Test Cases
    # -------------------------------
    
    def test_empty_string():
        # Empty string should return empty string
        codeflash_output = group_broken_paragraphs("") # 1.13μs -> 2.03μs (44.3% slower)
    
    def test_only_newlines():
        # String of only newlines should return empty string
        codeflash_output = group_broken_paragraphs("\n\n\n") # 2.70μs -> 3.52μs (23.1% slower)
    
    def test_spaces_and_newlines():
        # String of spaces and newlines should return empty string
        codeflash_output = group_broken_paragraphs("   \n  \n\n  ") # 2.91μs -> 3.90μs (25.4% slower)
    
    def test_single_word():
        # Single word should be returned as is
        codeflash_output = group_broken_paragraphs("Hello") # 5.77μs -> 6.09μs (5.24% slower)
    
    def test_single_line_paragraphs():
        # Multiple single-line paragraphs separated by double newlines
        text = "First para.\n\nSecond para.\n\nThird para."
        expected = "First para.\n\nSecond para.\n\nThird para."
        codeflash_output = group_broken_paragraphs(text) # 11.3μs -> 12.0μs (5.89% slower)
    
    def test_paragraph_with_trailing_newlines():
        # Paragraph with trailing newlines should be handled
        text = "The big red fox\nis walking down the lane.\n\n"
        expected = "The big red fox is walking down the lane."
        codeflash_output = group_broken_paragraphs(text) # 12.7μs -> 11.1μs (13.6% faster)
    
    def test_bullet_with_extra_spaces():
        # Bullet with extra spaces and newlines
        text = "  •   The quick brown\nfox jumps over\n  the lazy dog.  "
        expected = "•   The quick brown fox jumps over   the lazy dog.  "
        codeflash_output = group_broken_paragraphs(text) # 22.5μs -> 12.6μs (78.1% faster)
    
    def test_mixed_bullets_and_normal():
        # Mixed bullet and non-bullet paragraphs
        text = "• Bullet one\ncontinues here.\n\nNormal para\ncontinues here."
        expected = "• Bullet one continues here.\n\nNormal para continues here."
        codeflash_output = group_broken_paragraphs(text) # 22.0μs -> 15.6μs (40.8% faster)
    
    def test_multiple_bullet_styles():
        # Multiple Unicode bullet styles
        text = "• Bullet A\nline two.\n\n◦ Bullet B\nline two."
        expected = "• Bullet A line two.\n\n◦ Bullet B line two."
        codeflash_output = group_broken_paragraphs(text) # 23.7μs -> 12.4μs (90.4% faster)
    
    def test_short_and_long_lines_mixed():
        # A paragraph with both short and long lines
        text = "Short\nThis is a much longer line that should be joined\nAnother short"
        # Only the first and last lines are short, but the presence of a long line means the paragraph will be joined
        expected = "Short This is a much longer line that should be joined Another short"
        codeflash_output = group_broken_paragraphs(text) # 14.1μs -> 12.7μs (10.9% faster)
    
    def test_paragraph_with_tabs():
        # Paragraph with tabs instead of spaces
        text = "The big red fox\tis walking down the lane."
        expected = "The big red fox\tis walking down the lane."
        codeflash_output = group_broken_paragraphs(text) # 9.45μs -> 7.96μs (18.7% faster)
    
    def test_bullet_with_leading_newline():
        # Bullet point with a leading newline
        text = "\n• Bullet with leading newline\ncontinues here."
        expected = "• Bullet with leading newline continues here."
        codeflash_output = group_broken_paragraphs(text) # 18.7μs -> 9.98μs (87.2% faster)
    
    def test_bullet_with_trailing_newline():
        # Bullet point with a trailing newline
        text = "• Bullet with trailing newline\ncontinues here.\n"
        expected = "• Bullet with trailing newline continues here."
        codeflash_output = group_broken_paragraphs(text) # 17.2μs -> 9.58μs (79.6% faster)
    
    def test_unicode_bullet_variants():
        # Test with a variety of Unicode bullets
        text = "● Unicode bullet one\ncontinues\n\n○ Unicode bullet two\ncontinues"
        expected = "● Unicode bullet one continues\n\n○ Unicode bullet two continues"
        codeflash_output = group_broken_paragraphs(text) # 24.3μs -> 13.8μs (76.7% faster)
    
    def test_multiple_empty_paragraphs():
        # Multiple empty paragraphs between text
        text = "First para.\n\n\n\nSecond para."
        expected = "First para.\n\nSecond para."
        codeflash_output = group_broken_paragraphs(text) # 9.26μs -> 9.85μs (6.00% slower)
    
    # -------------------------------
    # 3. Large Scale Test Cases
    # -------------------------------
    
    def test_large_number_of_paragraphs():
        # 500 paragraphs, each with two lines to be joined
        paras = ["Line one {}\nLine two {}".format(i, i) for i in range(500)]
        text = "\n\n".join(paras)
        expected = "\n\n".join(["Line one {} Line two {}".format(i, i) for i in range(500)])
        codeflash_output = group_broken_paragraphs(text) # 1.36ms -> 1.29ms (5.79% faster)
    
    def test_large_number_of_bullets():
        # 300 bullet points, each with two lines
        paras = ["• Bullet {}\ncontinues here.".format(i) for i in range(300)]
        text = "\n\n".join(paras)
        expected = "\n\n".join(["• Bullet {} continues here.".format(i) for i in range(300)])
        codeflash_output = group_broken_paragraphs(text) # 1.98ms -> 969μs (104% faster)
    
    def test_large_mixed_content():
        # Mix of 200 normal paras and 200 bullets
        normal_paras = ["Normal {}\ncontinues".format(i) for i in range(200)]
        bullet_paras = ["• Bullet {}\ncontinues".format(i) for i in range(200)]
        all_paras = []
        for i in range(200):
            all_paras.append(normal_paras[i])
            all_paras.append(bullet_paras[i])
        text = "\n\n".join(all_paras)
        expected = "\n\n".join([
            "Normal {} continues".format(i) if j % 2 == 0 else "• Bullet {} continues".format(i//2)
            for j, i in enumerate(range(400))
        ])
        # Fix expected to match the correct sequence
        expected = "\n\n".join(
            ["Normal {} continues".format(i) for i in range(200)] +
            ["• Bullet {} continues".format(i) for i in range(200)]
        )
        # The function will process in order, so we need to interleave
        interleaved = []
        for i in range(200):
            interleaved.append("Normal {} continues".format(i))
            interleaved.append("• Bullet {} continues".format(i))
        expected = "\n\n".join(interleaved)
        codeflash_output = group_broken_paragraphs(text)
    
    def test_large_short_lines():
        # 1000 short lines, all should be preserved as is (not joined)
        text = "\n".join(["A {}".format(i) for i in range(1000)])
        expected = "\n".join(["A {}".format(i) for i in range(1000)])
        codeflash_output = group_broken_paragraphs(text) # 605μs -> 565μs (7.11% faster)
    
    def test_large_paragraph_with_long_lines():
        # One paragraph with 1000 long lines (should be joined into one)
        text = "\n".join(["This is a long line number {}".format(i) for i in range(1000)])
        expected = " ".join(["This is a long line number {}".format(i) for i in range(1000)])
        codeflash_output = group_broken_paragraphs(text) # 2.11ms -> 2.09ms (1.10% faster)
    # codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
    ```
    
    </details>
    
    
    To edit these changes `git checkout
    codeflash/optimize-group_broken_paragraphs-mcg8s57e` and push.
    
    
    [![Codeflash](https://img.shields.io/badge/Optimized%20with-Codeflash-yellow?style=flat&color=%23ffc428&logo=data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iNDgwIiBoZWlnaHQ9ImF1dG8iIHZpZXdCb3g9IjAgMCA0ODAgMjgwIiBmaWxsPSJub25lIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPgo8cGF0aCBmaWxsLXJ1bGU9ImV2ZW5vZGQiIGNsaXAtcnVsZT0iZXZlbm9kZCIgZD0iTTI4Ni43IDAuMzc4NDE4SDIwMS43NTFMNTAuOTAxIDE0OC45MTFIMTM1Ljg1MUwwLjk2MDkzOCAyODEuOTk5SDk1LjQzNTJMMjgyLjMyNCA4OS45NjE2SDE5Ni4zNDVMMjg2LjcgMC4zNzg0MThaIiBmaWxsPSIjRkZDMDQzIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMzExLjYwNyAwLjM3ODkwNkwyNTguNTc4IDU0Ljk1MjZIMzc5LjU2N0w0MzIuMzM5IDAuMzc4OTA2SDMxMS42MDdaIiBmaWxsPSIjMEIwQTBBIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMzA5LjU0NyA4OS45NjAxTDI1Ni41MTggMTQ0LjI3NkgzNzcuNTA2TDQzMC4wMjEgODkuNzAyNkgzMDkuNTQ3Vjg5Ljk2MDFaIiBmaWxsPSIjMEIwQTBBIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMjQyLjg3MyAxNjQuNjZMMTg5Ljg0NCAyMTkuMjM0SDMxMC44MzNMMzYzLjM0NyAxNjQuNjZIMjQyLjg3M1oiIGZpbGw9IiMwQjBBMEEiLz4KPC9zdmc+Cg==)](https://codeflash.ai)
    
    ---------
    
    Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
    Co-authored-by: Saurabh Misra <[email protected]>
    Co-authored-by: qued <[email protected]>
    Co-authored-by: Alan Bertl <[email protected]>
    5 people authored Sep 9, 2025
    Configuration menu
    Copy the full SHA
    6aee131 View commit details
    Browse the repository at this point in the history

Commits on Sep 10, 2025

  1. ⚡️ Speed up method ElementHtml._get_children_html by 234% (#4087)

    ### 📄 234% (2.34x) speedup for ***`ElementHtml._get_children_html` in
    `unstructured/partition/html/convert.py`***
    
    ⏱️ Runtime : **`12.3 milliseconds`** **→** **`3.69 milliseconds`** (best
    of `101` runs)
    ### 📝 Explanation and details
    
    Here is a **faster rewrite** of your program, based on your line
    profiling results, the imported code constraints, and the code logic.
    
    ### Key optimizations.
    
    - **Avoid repeated parsing:** The hotspot is in recursive calls to
    `child.get_html_element(**kwargs)`, each of which is re-creating a new
    `BeautifulSoup` object in every call.
    Solution: **Pass down and reuse a single `BeautifulSoup` instance** when
    building child HTML elements.
    - **Minimize object creation:** Create `soup` once at the *topmost* call
    and reuse for all children and subchildren.
    - **Reduce .get_text_as_html use:** Optimize to only use the soup
    instance when really necessary and avoid repeated blank parses.
    - **Avoid double wrapping:** Only allocate wrappers and new tags if
    absolutely required.
    - **General micro-optimizations:** Use `None` instead of `or []`,
    fast-path checks on empty children, etc.
    - **Preserve all comments and signatures as specified.**
    
    Below is the optimized version.
    
    
    
    ### Explanation of improvements
    
    - **Soup passing**: The `get_html_element` method now optionally
    receives a `_soup` kwarg. At the top of the tree, it is `None`, so a new
    one is created. Then, for all descendants, the same `soup` instance is
    passed via `_soup`, avoiding repeated parsing and allocation.
    - **Children check**: `self.children` is checked once, and the attribute
    itself is kept as a list (not or-ed with empty list at every call).
    - **No unnecessary soup parsing**: `get_text_as_html()` doesn't need a
    soup argument, since it only returns a Tag (from the parent module).
    - **No changes to existing comments, new comments added only where logic
    was changed.**
    - **Behavior (output and signature) preserved.**
    
    This **avoids creating thousands of BeautifulSoup objects recursively**,
    which was the primary bottleneck found in the profiler. The result is
    vastly improved performance, especially for large/complex trees.
    
    
    ✅ **Correctness verification report:**
    
    | Test                        | Status            |
    | --------------------------- | ----------------- |
    | ⚙️ Existing Unit Tests | 🔘 **None Found** |
    | 🌀 Generated Regression Tests | ✅ **768 Passed** |
    | ⏪ Replay Tests | ✅ **1 Passed** |
    | 🔎 Concolic Coverage Tests | 🔘 **None Found** |
    |📊 Tests Coverage       | 100.0% |
    <details>
    <summary>🌀 Generated Regression Tests and Runtime</summary>
    
    ```python
    from abc import ABC
    from typing import Any, List, Optional, Union
    
    # imports
    import pytest  # used for our unit tests
    from bs4 import BeautifulSoup, Tag
    from unstructured.partition.html.convert import ElementHtml
    
    # --- Minimal stubs for dependencies ---
    
    class Metadata:
        def __init__(self, text_as_html: Optional[str] = None):
            self.text_as_html = text_as_html
    
    class Element:
        def __init__(self, text="", category="default", id="0", metadata=None):
            self.text = text
            self.category = category
            self.id = id
            self.metadata = metadata or Metadata()
    
    # --- The function and class under test ---
    
    HTML_PARSER = "html.parser"
    
    # --- Test helpers ---
    
    class DummyElementHtml(ElementHtml):
        """A concrete subclass for testing, with optional custom tag."""
        def __init__(self, element, children=None, html_tag="div"):
            super().__init__(element, children)
            self._html_tag = html_tag
    
    # --- Unit tests for _get_children_html ---
    
    @pytest.fixture
    def soup():
        # Fixture for a BeautifulSoup object
        return BeautifulSoup("", HTML_PARSER)
    
    def make_tag(soup, name, text=None, **attrs):
        tag = soup.new_tag(name)
        if text:
            tag.string = text
        for k, v in attrs.items():
            tag[k] = v
        return tag
    
    # 1. BASIC TEST CASES
    
    def test_single_child_basic(soup):
        """Single child: Should wrap parent and child in a div, in order."""
        parent_el = Element("Parent", category="parent", id="p1")
        child_el = Element("Child", category="child", id="c1")
        child = DummyElementHtml(child_el)
        parent = DummyElementHtml(parent_el, children=[child])
        # Prepare the parent tag
        parent_tag = make_tag(soup, "div", "Parent", **{"class": "parent", "id": "p1"})
        # Call _get_children_html
        codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output
        divs = result.find_all("div", recursive=False)
    
    def test_multiple_children_basic(soup):
        """Multiple children: All children should be appended in order."""
        parent_el = Element("Parent", category="parent", id="p1")
        child1_el = Element("Child1", category="child", id="c1")
        child2_el = Element("Child2", category="child", id="c2")
        child1 = DummyElementHtml(child1_el)
        child2 = DummyElementHtml(child2_el)
        parent = DummyElementHtml(parent_el, children=[child1, child2])
        parent_tag = make_tag(soup, "div", "Parent", **{"class": "parent", "id": "p1"})
        codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output
        divs = result.find_all("div", recursive=False)
    
    def test_no_children_returns_parent_wrapped(soup):
        """No children: Should still wrap parent in a div."""
        parent_el = Element("Parent", category="parent", id="p1")
        parent = DummyElementHtml(parent_el, children=[])
        parent_tag = make_tag(soup, "div", "Parent", **{"class": "parent", "id": "p1"})
        codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output
        inner_divs = result.find_all("div", recursive=False)
    
    def test_children_with_different_tags(soup):
        """Children with different HTML tags should be preserved."""
        parent_el = Element("Parent", category="parent", id="p1")
        child_el = Element("Child", category="child", id="c1")
        child = DummyElementHtml(child_el, html_tag="span")
        parent = DummyElementHtml(parent_el, children=[child])
        parent_tag = make_tag(soup, "div", "Parent", **{"class": "parent", "id": "p1"})
        codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output
    
    # 2. EDGE TEST CASES
    
    def test_empty_element_text_and_children(soup):
        """Parent and children have empty text."""
        parent_el = Element("", category="parent", id="p1")
        child_el = Element("", category="child", id="c1")
        child = DummyElementHtml(child_el)
        parent = DummyElementHtml(parent_el, children=[child])
        parent_tag = make_tag(soup, "div", "", **{"class": "parent", "id": "p1"})
        codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output
        divs = result.find_all("div", recursive=False)
    
    def test_deeply_nested_children(soup):
        """Test with deep nesting (e.g., 5 levels)."""
        # Build a chain: root -> c1 -> c2 -> c3 -> c4 -> c5
        el = Element("root", category="cat0", id="id0")
        node = DummyElementHtml(el)
        for i in range(1, 6):
            el = Element(f"c{i}", category=f"cat{i}", id=f"id{i}")
            node = DummyElementHtml(el, children=[node])
        # At the top, node is the outermost parent
        parent_tag = make_tag(soup, "div", "c5", **{"class": "cat5", "id": "id5"})
        codeflash_output = node._get_children_html(soup, parent_tag); result = codeflash_output
        # Should have one child at each level
        current = result
        for i in range(6):
            divs = [c for c in current.contents if isinstance(c, Tag)]
            current = divs[0]
    
    def test_html_injection_in_text(soup):
        """Child text that looks like HTML should be escaped, not parsed as HTML."""
        parent_el = Element("Parent", category="parent", id="p1")
        child_el = Element("<b>bold</b>", category="child", id="c1")
        child = DummyElementHtml(child_el)
        parent = DummyElementHtml(parent_el, children=[child])
        parent_tag = make_tag(soup, "div", "Parent", **{"class": "parent", "id": "p1"})
        codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output
        # The child div should have literal text, not a <b> tag inside
        child_div = result.find_all("div", recursive=False)[1]
    
    def test_children_with_duplicate_ids(soup):
        """Multiple children with the same id."""
        parent_el = Element("Parent", category="parent", id="p1")
        child1_el = Element("Child1", category="child", id="dup")
        child2_el = Element("Child2", category="child", id="dup")
        child1 = DummyElementHtml(child1_el)
        child2 = DummyElementHtml(child2_el)
        parent = DummyElementHtml(parent_el, children=[child1, child2])
        parent_tag = make_tag(soup, "div", "Parent", **{"class": "parent", "id": "p1"})
        codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output
        # Both children should be present, even with duplicate ids
        divs = result.find_all("div", recursive=False)
    
    def test_children_with_none(soup):
        """Children list contains None (should ignore or raise)."""
        parent_el = Element("Parent", category="parent", id="p1")
        child_el = Element("Child", category="child", id="c1")
        child = DummyElementHtml(child_el)
        parent = DummyElementHtml(parent_el, children=[child, None])
        parent_tag = make_tag(soup, "div", "Parent", **{"class": "parent", "id": "p1"})
        # Should raise AttributeError when trying to call get_html_element on None
        with pytest.raises(AttributeError):
            parent._get_children_html(soup, parent_tag)
    
    # 3. LARGE SCALE TEST CASES
    
    def test_many_children_performance(soup):
        """Test with 500 children: structure and order."""
        parent_el = Element("Parent", category="parent", id="p1")
        children = [DummyElementHtml(Element(f"Child{i}", category="child", id=f"c{i}")) for i in range(500)]
        parent = DummyElementHtml(parent_el, children=children)
        parent_tag = make_tag(soup, "div", "Parent", **{"class": "parent", "id": "p1"})
        codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output
        divs = result.find_all("div", recursive=False)
    
    def test_large_tree_width_and_depth(soup):
        """Test with a tree of width 10 and depth 3 (total 1 + 10 + 100 = 111 nodes)."""
        def make_tree(depth, width):
            if depth == 0:
                return []
            return [
                DummyElementHtml(
                    Element(f"Child{depth}_{i}", category="cat", id=f"id{depth}_{i}"),
                    children=make_tree(depth-1, width)
                )
                for i in range(width)
            ]
        parent_el = Element("Root", category="root", id="root")
        children = make_tree(2, 10)  # depth=2, width=10 at each node
        parent = DummyElementHtml(parent_el, children=children)
        parent_tag = make_tag(soup, "div", "Root", **{"class": "root", "id": "root"})
        codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output
        # The first level should have 1 parent + 10 children
        divs = result.find_all("div", recursive=False)
        # Each child should have its own children (10 each)
        for child_div in divs[1:]:
            sub_divs = child_div.find_all("div", recursive=False)
    
    def test_large_text_content(soup):
        """Test with a single child with a very large text string."""
        large_text = "A" * 10000
        parent_el = Element("Parent", category="parent", id="p1")
        child_el = Element(large_text, category="child", id="c1")
        child = DummyElementHtml(child_el)
        parent = DummyElementHtml(parent_el, children=[child])
        parent_tag = make_tag(soup, "div", "Parent", **{"class": "parent", "id": "p1"})
        codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output
        # The child div should contain the large text exactly
        child_div = result.find_all("div", recursive=False)[1]
    
    def test_children_with_varied_tags_and_attributes(soup):
        """Test children with different tags and extra attributes."""
        parent_el = Element("P", category="parent", id="p")
        child1_el = Element("C1", category="c1", id="c1")
        child2_el = Element("C2", category="c2", id="c2")
        child1 = DummyElementHtml(child1_el, html_tag="section")
        child2 = DummyElementHtml(child2_el, html_tag="article")
        parent = DummyElementHtml(parent_el, children=[child1, child2])
        parent_tag = make_tag(soup, "header", "P", **{"class": "parent", "id": "p"})
        codeflash_output = parent._get_children_html(soup, parent_tag); result = codeflash_output
    # codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
    
    from abc import ABC
    from typing import Any, Optional, Union
    
    # imports
    import pytest  # used for our unit tests
    from bs4 import BeautifulSoup, Tag
    from unstructured.partition.html.convert import ElementHtml
    
    
    # Minimal stub for Element and its metadata
    class Metadata:
        def __init__(self, text_as_html=None):
            self.text_as_html = text_as_html
    
    class Element:
        def __init__(self, text="", category=None, id=None, metadata=None):
            self.text = text
            self.category = category or "default-category"
            self.id = id or "default-id"
            self.metadata = metadata or Metadata()
    
    HTML_PARSER = "html.parser"
    
    # ---------------------------
    # Unit tests for _get_children_html
    # ---------------------------
    
    # Helper subclass to expose _get_children_html for testing
    class TestElementHtml(ElementHtml):
        def public_get_children_html(self, soup, element_html, **kwargs):
            return self._get_children_html(soup, element_html, **kwargs)
    
        # Override get_html_element to avoid recursion issues in tests
        def get_html_element(self, **kwargs: Any) -> Tag:
            soup = BeautifulSoup("", HTML_PARSER)
            element_html = self.get_text_as_html()
            if element_html is None:
                element_html = soup.new_tag(name=self.html_tag)
                self._inject_html_element_content(element_html, **kwargs)
            element_html["class"] = self.element.category
            element_html["id"] = self.element.id
            self._inject_html_element_attrs(element_html)
            if self.children:
                return self._get_children_html(soup, element_html, **kwargs)
            return element_html
    
    # ---- BASIC TEST CASES ----
    
    def test_single_child_basic():
        # Test with one parent and one child
        parent_elem = Element(text="Parent", category="parent-cat", id="parent-id")
        child_elem = Element(text="Child", category="child-cat", id="child-id")
        child = TestElementHtml(child_elem)
        parent = TestElementHtml(parent_elem, children=[child])
    
        soup = BeautifulSoup("", HTML_PARSER)
        parent_html = soup.new_tag("div")
        parent_html.string = parent_elem.text
        parent_html["class"] = parent_elem.category
        parent_html["id"] = parent_elem.id
    
        # Call the function
        result = parent.public_get_children_html(soup, parent_html)
    
    def test_multiple_children_basic():
        # Parent with two children
        parent_elem = Element(text="P", category="p-cat", id="p-id")
        child1_elem = Element(text="C1", category="c1-cat", id="c1-id")
        child2_elem = Element(text="C2", category="c2-cat", id="c2-id")
        child1 = TestElementHtml(child1_elem)
        child2 = TestElementHtml(child2_elem)
        parent = TestElementHtml(parent_elem, children=[child1, child2])
    
        soup = BeautifulSoup("", HTML_PARSER)
        parent_html = soup.new_tag("div")
        parent_html.string = parent_elem.text
        parent_html["class"] = parent_elem.category
        parent_html["id"] = parent_elem.id
    
        result = parent.public_get_children_html(soup, parent_html)
    
    def test_no_children_returns_wrapper_with_only_parent():
        # Parent with no children, should still wrap parent_html in a div
        parent_elem = Element(text="Solo", category="solo-cat", id="solo-id")
        parent = TestElementHtml(parent_elem, children=[])
    
        soup = BeautifulSoup("", HTML_PARSER)
        parent_html = soup.new_tag("div")
        parent_html.string = parent_elem.text
        parent_html["class"] = parent_elem.category
        parent_html["id"] = parent_elem.id
    
        result = parent.public_get_children_html(soup, parent_html)
    
    def test_children_are_nested():
        # Test with a deeper hierarchy: parent -> child -> grandchild
        grandchild_elem = Element(text="GC", category="gc-cat", id="gc-id")
        grandchild = TestElementHtml(grandchild_elem)
        child_elem = Element(text="C", category="c-cat", id="c-id")
        child = TestElementHtml(child_elem, children=[grandchild])
        parent_elem = Element(text="P", category="p-cat", id="p-id")
        parent = TestElementHtml(parent_elem, children=[child])
    
        soup = BeautifulSoup("", HTML_PARSER)
        parent_html = soup.new_tag("div")
        parent_html.string = parent_elem.text
        parent_html["class"] = parent_elem.category
        parent_html["id"] = parent_elem.id
    
        result = parent.public_get_children_html(soup, parent_html)
        child_div = result.contents[1]
        grandchild_div = child_div.contents[1]
    
    # ---- EDGE TEST CASES ----
    
    def test_empty_text_and_attributes():
        # Parent and child with empty text and missing attributes
        parent_elem = Element(text="", category="", id="")
        child_elem = Element(text="", category="", id="")
        child = TestElementHtml(child_elem)
        parent = TestElementHtml(parent_elem, children=[child])
    
        soup = BeautifulSoup("", HTML_PARSER)
        parent_html = soup.new_tag("div")
        parent_html.string = parent_elem.text
        parent_html["class"] = parent_elem.category
        parent_html["id"] = parent_elem.id
    
        result = parent.public_get_children_html(soup, parent_html)
    
    def test_child_with_html_content():
        # Child with HTML in text_as_html, should parse as HTML element
        child_elem = Element(text="ignored", category="cat", id="cid",
                             metadata=Metadata(text_as_html="<span>HTMLChild</span>"))
        child = TestElementHtml(child_elem)
        parent_elem = Element(text="Parent", category="pcat", id="pid")
        parent = TestElementHtml(parent_elem, children=[child])
    
        soup = BeautifulSoup("", HTML_PARSER)
        parent_html = soup.new_tag("div")
        parent_html.string = parent_elem.text
        parent_html["class"] = parent_elem.category
        parent_html["id"] = parent_elem.id
    
        result = parent.public_get_children_html(soup, parent_html)
        child_html = result.contents[1]
    
    def test_parent_with_html_content_and_children():
        # Parent with HTML in text_as_html, children as normal
        parent_elem = Element(text="ignored", category="pcat", id="pid",
                              metadata=Metadata(text_as_html="<h1>Header</h1>"))
        child_elem = Element(text="Child", category="ccat", id="cid")
        child = TestElementHtml(child_elem)
        parent = TestElementHtml(parent_elem, children=[child])
    
        soup = BeautifulSoup("", HTML_PARSER)
        parent_html = parent.get_text_as_html()
        parent_html["class"] = parent_elem.category
        parent_html["id"] = parent_elem.id
    
        result = parent.public_get_children_html(soup, parent_html)
    
    def test_children_with_duplicate_ids():
        # Children with the same id, should not raise errors, but both ids should be present
        child_elem1 = Element(text="A", category="cat", id="dup")
        child_elem2 = Element(text="B", category="cat", id="dup")
        child1 = TestElementHtml(child_elem1)
        child2 = TestElementHtml(child_elem2)
        parent_elem = Element(text="P", category="pcat", id="pid")
        parent = TestElementHtml(parent_elem, children=[child1, child2])
    
        soup = BeautifulSoup("", HTML_PARSER)
        parent_html = soup.new_tag("div")
        parent_html.string = parent_elem.text
        parent_html["class"] = parent_elem.category
        parent_html["id"] = parent_elem.id
    
        result = parent.public_get_children_html(soup, parent_html)
    
    def test_children_with_various_html_tags():
        # Children with different html_tag settings
        class CustomElementHtml(TestElementHtml):
            _html_tag = "section"
    
        child_elem = Element(text="Sec", category="cat", id="cid")
        child = CustomElementHtml(child_elem)
        parent_elem = Element(text="P", category="pcat", id="pid")
        parent = TestElementHtml(parent_elem, children=[child])
    
        soup = BeautifulSoup("", HTML_PARSER)
        parent_html = soup.new_tag("div")
        parent_html.string = parent_elem.text
        parent_html["class"] = parent_elem.category
        parent_html["id"] = parent_elem.id
    
        result = parent.public_get_children_html(soup, parent_html)
    
    def test_html_tag_property_override():
        # Test that html_tag property is respected
        class CustomElementHtml(TestElementHtml):
            @Property
            def html_tag(self):
                return "article"
    
        child_elem = Element(text="Art", category="cat", id="cid")
        child = CustomElementHtml(child_elem)
        parent_elem = Element(text="P", category="pcat", id="pid")
        parent = TestElementHtml(parent_elem, children=[child])
    
        soup = BeautifulSoup("", HTML_PARSER)
        parent_html = soup.new_tag("div")
        parent_html.string = parent_elem.text
        parent_html["class"] = parent_elem.category
        parent_html["id"] = parent_elem.id
    
        result = parent.public_get_children_html(soup, parent_html)
    
    def test_inject_html_element_attrs_is_called():
        # Test that _inject_html_element_attrs is called (by side effect)
        class AttrElementHtml(TestElementHtml):
            def _inject_html_element_attrs(self, element_html: Tag) -> None:
                element_html["data-test"] = "called"
    
        child_elem = Element(text="Child", category="cat", id="cid")
        child = AttrElementHtml(child_elem)
        parent_elem = Element(text="P", category="pcat", id="pid")
        parent = AttrElementHtml(parent_elem, children=[child])
    
        soup = BeautifulSoup("", HTML_PARSER)
        parent_html = soup.new_tag("div")
        parent_html.string = parent_elem.text
        parent_html["class"] = parent_elem.category
        parent_html["id"] = parent_elem.id
    
        result = parent.public_get_children_html(soup, parent_html)
    
    # ---- LARGE SCALE TEST CASES ----
    
    def test_large_number_of_children():
        # Test with 500 children
        num_children = 500
        children = [TestElementHtml(Element(text=f"Child{i}", category="cat", id=f"id{i}")) for i in range(num_children)]
        parent_elem = Element(text="Parent", category="pcat", id="pid")
        parent = TestElementHtml(parent_elem, children=children)
    
        soup = BeautifulSoup("", HTML_PARSER)
        parent_html = soup.new_tag("div")
        parent_html.string = parent_elem.text
        parent_html["class"] = parent_elem.category
        parent_html["id"] = parent_elem.id
    
        result = parent.public_get_children_html(soup, parent_html)
    
    def test_large_depth_of_nesting():
        # Test with 100 nested single-child levels
        depth = 100
        current = TestElementHtml(Element(text=f"Level{depth}", category="cat", id=f"id{depth}"))
        for i in range(depth-1, 0, -1):
            current = TestElementHtml(Element(text=f"Level{i}", category="cat", id=f"id{i}"), children=[current])
        parent = current
    
        soup = BeautifulSoup("", HTML_PARSER)
        parent_html = soup.new_tag("div")
        parent_html.string = parent.element.text
        parent_html["class"] = parent.element.category
        parent_html["id"] = parent.element.id
    
        result = parent.public_get_children_html(soup, parent_html)
        # Traverse down the nesting, checking text at each level
        node = result
        for i in range(1, depth+1):
            if len(node.contents) > 1:
                node = node.contents[1]
            else:
                break
    
    def test_large_tree_with_breadth_and_depth():
        # 10 children, each with 10 children (total 1 + 10 + 100 = 111 nodes)
        children = []
        for i in range(10):
            grandchildren = [TestElementHtml(Element(text=f"GC{i}-{j}", category="gcat", id=f"gid{i}-{j}")) for j in range(10)]
            child = TestElementHtml(Element(text=f"C{i}", category="ccat", id=f"cid{i}"), children=grandchildren)
            children.append(child)
        parent_elem = Element(text="P", category="pcat", id="pid")
        parent = TestElementHtml(parent_elem, children=children)
    
        soup = BeautifulSoup("", HTML_PARSER)
        parent_html = soup.new_tag("div")
        parent_html.string = parent_elem.text
        parent_html["class"] = parent_elem.category
        parent_html["id"] = parent_elem.id
    
        result = parent.public_get_children_html(soup, parent_html)
        for i, child_div in enumerate(result.contents[1:]):
            for j, gc_div in enumerate(child_div.contents[1:]):
                pass
    # codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
    ```
    
    </details>
    
    
    To edit these changes `git checkout
    codeflash/optimize-ElementHtml._get_children_html-mcsd67co` and push.
    
    
    [![Codeflash](https://img.shields.io/badge/Optimized%20with-Codeflash-yellow?style=flat&color=%23ffc428&logo=data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iNDgwIiBoZWlnaHQ9ImF1dG8iIHZpZXdCb3g9IjAgMCA0ODAgMjgwIiBmaWxsPSJub25lIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPgo8cGF0aCBmaWxsLXJ1bGU9ImV2ZW5vZGQiIGNsaXAtcnVsZT0iZXZlbm9kZCIgZD0iTTI4Ni43IDAuMzc4NDE4SDIwMS43NTFMNTAuOTAxIDE0OC45MTFIMTM1Ljg1MUwwLjk2MDkzOCAyODEuOTk5SDk1LjQzNTJMMjgyLjMyNCA4OS45NjE2SDE5Ni4zNDVMMjg2LjcgMC4zNzg0MThaIiBmaWxsPSIjRkZDMDQzIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMzExLjYwNyAwLjM3ODkwNkwyNTguNTc4IDU0Ljk1MjZIMzc5LjU2N0w0MzIuMzM5IDAuMzc4OTA2SDMxMS42MDdaIiBmaWxsPSIjMEIwQTBBIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMzA5LjU0NyA4OS45NjAxTDI1Ni41MTggMTQ0LjI3NkgzNzcuNTA2TDQzMC4wMjEgODkuNzAyNkgzMDkuNTQ3Vjg5Ljk2MDFaIiBmaWxsPSIjMEIwQTBBIi8+CjxwYXRoIGZpbGwtcnVsZT0iZXZlbm9kZCIgY2xpcC1ydWxlPSJldmVub2RkIiBkPSJNMjQyLjg3MyAxNjQuNjZMMTg5Ljg0NCAyMTkuMjM0SDMxMC44MzNMMzYzLjM0NyAxNjQuNjZIMjQyLjg3M1oiIGZpbGw9IiMwQjBBMEEiLz4KPC9zdmc+Cg==)](https://codeflash.ai)
    
    ---------
    
    Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
    Co-authored-by: Saurabh Misra <[email protected]>
    3 people authored Sep 10, 2025
    Configuration menu
    Copy the full SHA
    ab55d86 View commit details
    Browse the repository at this point in the history

Commits on Sep 17, 2025

  1. Luke/sept16 CVE (#4094)

    dependancy bump and version bump. mainly to resolve the crit in deepdif
    
    ---------
    
    Co-authored-by: cragwolfe <[email protected]>
    luke-kucing and cragwolfe authored Sep 17, 2025
    Configuration menu
    Copy the full SHA
    2d44d73 View commit details
    Browse the repository at this point in the history
Loading