Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed #30686 -- Used Python HTMLParser in utils.text.Truncator #16421

Merged
merged 1 commit into from Feb 7, 2024

Conversation

smithdc1
Copy link
Member

@smithdc1 smithdc1 commented Jan 3, 2023

Marking as draft -- it's not finished but thought it may be helpful to share my thinking as the ticket is being reviewed. Even if the proposal isn't valid, maybe the extra tests are worth keeping.

https://code.djangoproject.com/ticket/30686

tests/utils_tests/test_text.py Show resolved Hide resolved
django/utils/text.py Outdated Show resolved Hide resolved
tests/utils_tests/test_text.py Show resolved Hide resolved
@smithdc1
Copy link
Member Author

smithdc1 commented Jan 8, 2023

We should bear in mind 7f65974

@smithdc1 smithdc1 force-pushed the ticket_30686 branch 2 times, most recently from 64e5739 to 9af6c3b Compare January 31, 2023 07:26
@smithdc1
Copy link
Member Author

Having checked out my branch / PR to django-asv (django/django-asv#70) and running asv continuous main 9af6c3b20a55d62e5c3a8b4425a4593a915332a6 -b utils_benchmarks I get:

       before           after         ratio
     [e1a093f8]       [9af6c3b2]
     <main>           <ticket_30686_2>
-     2.20±0.01ms         1.68±0ms     0.76  utils_benchmarks.truncator.benchmark.TruncatorBenchmark.time_chars_long
-      34.2±0.2ms       24.5±0.2ms     0.71  utils_benchmarks.truncator.benchmark.TruncatorBenchmark.time_chars_long_html
-        1.63±0ms      1.10±0.01ms     0.68  utils_benchmarks.truncator.benchmark.TruncatorBenchmark.time_chars_short
-      78.9±0.8ms         52.8±1ms     0.67  utils_benchmarks.truncator.benchmark.TruncatorBenchmark.time_words_long_html
-      22.4±0.3ms      9.90±0.08ms     0.44  utils_benchmarks.truncator.benchmark.TruncatorBenchmark.time_words_short_html

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@smithdc1 smithdc1 marked this pull request as ready for review February 8, 2023 07:52
@carltongibson
Copy link
Member

@matthiask you're the domain expert here. Can I ask you to (re-)review with a mind to are we doing it? — thanks! 🎁

@ngnpope
Copy link
Member

ngnpope commented Feb 12, 2023

I added some suggested changes in smithdc1#3.

@carltongibson
Copy link
Member

Some of these are not valid HTML so don't get through the HTML parser. e.g 50,000 &s. I'd need a bit of help to understand what exactly this is testing? (#16421 (comment))

We should bear in mind 7f65974

OK... so... what motivated the whole ticket here was a persistent trickle of security reports about ever more ingenious ways of making the regex misbehave. That problem is intractable, hence Let's use a parser. As far as I can see, adjusting the tests there to show that the input is (quickly) rejected would be sufficient. (We need to ensure that we don't open a DoS vector, not necessarily handle such deviant input.) //cc @django/django-security-team

@carltongibson
Copy link
Member

#16421 (comment) — Using the parser is quicker. I think we're green.

@smithdc1 if you want to look at @ngnpope's comments, and resolve, we should be able to push this forwards.

Thanks for effort! 🏅

(I will mark as PNI on the ticket for the moment.)

@matthiask
Copy link
Contributor

Sorry for not reacting sooner here. The speed up is a pleasant and unexpected surprise. I do not have any further comments apart from LGTM, thanks!

@claudep
Copy link
Member

claudep commented Mar 8, 2023

Looks great! May I ask about how invalid html is treated before and after the patch?

django/utils/text.py Outdated Show resolved Hide resolved
django/utils/text.py Outdated Show resolved Hide resolved
django/utils/text.py Outdated Show resolved Hide resolved
Copy link
Member

@ngnpope ngnpope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @smithdc1! Thanks for the update. I've had another thorough pass through this. Let me know if you want any help getting this progressed 🙂

django/utils/text.py Outdated Show resolved Hide resolved
django/utils/text.py Outdated Show resolved Hide resolved
django/utils/text.py Outdated Show resolved Hide resolved
django/utils/text.py Outdated Show resolved Hide resolved
django/utils/text.py Outdated Show resolved Hide resolved
tests/utils_tests/test_text.py Show resolved Hide resolved
tests/utils_tests/test_text.py Outdated Show resolved Hide resolved
tests/utils_tests/test_text.py Outdated Show resolved Hide resolved
tests/utils_tests/test_text.py Outdated Show resolved Hide resolved
@felixxm
Copy link
Member

felixxm commented Jul 14, 2023

Moved 2 commits to the #17071.

@felixxm
Copy link
Member

felixxm commented Jul 14, 2023

Please rebase.

@felixxm felixxm self-assigned this Jul 14, 2023
Copy link
Member

@ngnpope ngnpope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks David. Definitely feels like we're getting closer.

django/utils/text.py Outdated Show resolved Hide resolved
django/utils/text.py Outdated Show resolved Hide resolved
django/utils/text.py Outdated Show resolved Hide resolved
@@ -41,7 +41,7 @@ def test_truncate_unicode(self):
)

def test_truncate_something(self):
self.assertEqual(truncatechars_html("a<b>b</b>c", 3), "a<b>b</b>c")
self.assertEqual(truncatechars_html("a<b>b</b>c", 3), "a<b>b</b>")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so this is definitely a regression we're introducing here that we should fix.

I added the following test to tests/template_tests/filter_tests/test_truncatechars.py which passes fine:

    @setup({"truncatechars04": "{{ a|truncatechars:3 }}"})
    def test_truncatechars01(self):
        output = self.engine.render_to_string(
            "truncatechars04", {"a": "abc"}
        )
        self.assertEqual(output, "abc")

I think we should add this as it covers the exact length case which is missing there.

The HTML chars case is regressing. The problem is on this line:

https://github.com/django/django/pull/16421/files#diff-a6acaa0a744b3a7c841d9ab3ccbc1765f5749e3782fced61edced456c227b43fR188

We are passing in length=truncate_len when we should be passing in length=length, as for the words case. When you look at the text chars case we are passing through both length and truncate_len.

I suspect that we will want to hoist the following into a global helper:

https://github.com/django/django/pull/16421/files#diff-a6acaa0a744b3a7c841d9ab3ccbc1765f5749e3782fced61edced456c227b43fR180-R186

Which we can then call independently from the appropriate place in the text branch or the HTML branch (inside of TruncateCharsHTMLParser).

@nessita
Copy link
Contributor

nessita commented Sep 6, 2023

@smithdc1 Hey! Is this ready for re-review? If yes perhaps we should unset the needs improvement flag in the ticket.

Copy link
Contributor

@nessita nessita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial comments, looks good!

docs/releases/5.0.txt Outdated Show resolved Hide resolved
docs/releases/5.0.txt Outdated Show resolved Hide resolved
docs/releases/5.0.txt Outdated Show resolved Hide resolved
tests/utils_tests/test_text.py Show resolved Hide resolved
@felixxm
Copy link
Member

felixxm commented Sep 7, 2023

@smithdc1 Hey! Is this ready for re-review? If yes perhaps we should unset the needs improvement flag in the ticket.

There are unresolved comments, e.g. #16421 (comment) or #16421 (comment).

@nessita
Copy link
Contributor

nessita commented Sep 7, 2023

There are unresolved comments, e.g. #16421 (comment) or #16421 (comment).

I believe the first one is already addressed in 4f1791c

@nessita
Copy link
Contributor

nessita commented Jan 15, 2024

@smithdc1 Hi! Happy New Year 🎆

Would you have time to keep working on this? If not I might try to push it to the finish line. Let me know! Thanks.

@smithdc1
Copy link
Member Author

Hi @nessita -- happy new year. Hope you had a good break! 🎆

Yes. Let me update this one. I'll have to remind myself of where I got to!

@felixxm
Copy link
Member

felixxm commented Feb 7, 2024

@smithdc1 Thanks for all your efforts 👍

@ngnpope
Copy link
Member

ngnpope commented Feb 7, 2024

Yes, thanks @smithdc1. This is a solid improvement. Also thanks for enduring my tweaks/reviews!

@felixxm felixxm merged commit 6ee37ad into django:main Feb 7, 2024
35 checks passed
@smithdc1 smithdc1 deleted the ticket_30686 branch February 7, 2024 10:59
@pauloxnet
Copy link
Contributor

Great improvement @smithdc1 👏🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
10 participants