Fixed #30686 -- Used Python HTMLParser in utils.text.Truncator #16421

smithdc1 · 2023-01-03T20:52:27Z

Marking as draft -- it's not finished but thought it may be helpful to share my thinking as the ticket is being reviewed. Even if the proposal isn't valid, maybe the extra tests are worth keeping.

https://code.djangoproject.com/ticket/30686

tests/utils_tests/test_text.py

django/utils/text.py

tests/utils_tests/test_text.py

smithdc1 · 2023-01-08T13:03:23Z

We should bear in mind 7f65974

smithdc1 · 2023-01-31T07:31:29Z

Having checked out my branch / PR to django-asv (django/django-asv#70) and running asv continuous main 9af6c3b20a55d62e5c3a8b4425a4593a915332a6 -b utils_benchmarks I get:

       before           after         ratio
     [e1a093f8]       [9af6c3b2]
     <main>           <ticket_30686_2>
-     2.20±0.01ms         1.68±0ms     0.76  utils_benchmarks.truncator.benchmark.TruncatorBenchmark.time_chars_long
-      34.2±0.2ms       24.5±0.2ms     0.71  utils_benchmarks.truncator.benchmark.TruncatorBenchmark.time_chars_long_html
-        1.63±0ms      1.10±0.01ms     0.68  utils_benchmarks.truncator.benchmark.TruncatorBenchmark.time_chars_short
-      78.9±0.8ms         52.8±1ms     0.67  utils_benchmarks.truncator.benchmark.TruncatorBenchmark.time_words_long_html
-      22.4±0.3ms      9.90±0.08ms     0.44  utils_benchmarks.truncator.benchmark.TruncatorBenchmark.time_words_short_html

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

carltongibson · 2023-02-08T08:22:39Z

@matthiask you're the domain expert here. Can I ask you to (re-)review with a mind to are we doing it? — thanks! 🎁

ngnpope · 2023-02-12T21:07:38Z

I added some suggested changes in smithdc1#3.

carltongibson · 2023-03-08T10:49:26Z

Some of these are not valid HTML so don't get through the HTML parser. e.g 50,000 &s. I'd need a bit of help to understand what exactly this is testing? (#16421 (comment))

We should bear in mind 7f65974

OK... so... what motivated the whole ticket here was a persistent trickle of security reports about ever more ingenious ways of making the regex misbehave. That problem is intractable, hence Let's use a parser. As far as I can see, adjusting the tests there to show that the input is (quickly) rejected would be sufficient. (We need to ensure that we don't open a DoS vector, not necessarily handle such deviant input.) //cc @django/django-security-team

carltongibson · 2023-03-08T10:51:29Z

#16421 (comment) — Using the parser is quicker. I think we're green.

@smithdc1 if you want to look at @ngnpope's comments, and resolve, we should be able to push this forwards.

Thanks for effort! 🏅

(I will mark as PNI on the ticket for the moment.)

matthiask · 2023-03-08T10:53:56Z

Sorry for not reacting sooner here. The speed up is a pleasant and unexpected surprise. I do not have any further comments apart from LGTM, thanks!

claudep · 2023-03-08T12:01:31Z

Looks great! May I ask about how invalid html is treated before and after the patch?

django/utils/text.py

ngnpope

Hi @smithdc1! Thanks for the update. I've had another thorough pass through this. Let me know if you want any help getting this progressed 🙂

django/utils/text.py

tests/template_tests/filter_tests/test_truncatewords_html.py

tests/utils_tests/test_text.py

felixxm · 2023-07-14T08:27:47Z

Moved 2 commits to the #17071.

felixxm · 2023-07-14T09:27:01Z

Please rebase.

ngnpope

Thanks David. Definitely feels like we're getting closer.

django/utils/text.py

ngnpope · 2023-07-14T15:22:07Z

tests/template_tests/filter_tests/test_truncatechars_html.py

@@ -41,7 +41,7 @@ def test_truncate_unicode(self):
        )

    def test_truncate_something(self):
-        self.assertEqual(truncatechars_html("a<b>b</b>c", 3), "a<b>b</b>c")
+        self.assertEqual(truncatechars_html("a<b>b</b>c", 3), "a<b>b</b>…")


Ok, so this is definitely a regression we're introducing here that we should fix.

I added the following test to tests/template_tests/filter_tests/test_truncatechars.py which passes fine:

@setup({"truncatechars04": "{{ a|truncatechars:3 }}"}) def test_truncatechars01(self): output = self.engine.render_to_string( "truncatechars04", {"a": "abc"} ) self.assertEqual(output, "abc")

I think we should add this as it covers the exact length case which is missing there.

The HTML chars case is regressing. The problem is on this line:

https://github.com/django/django/pull/16421/files#diff-a6acaa0a744b3a7c841d9ab3ccbc1765f5749e3782fced61edced456c227b43fR188

We are passing in length=truncate_len when we should be passing in length=length, as for the words case. When you look at the text chars case we are passing through both length and truncate_len.

I suspect that we will want to hoist the following into a global helper:

https://github.com/django/django/pull/16421/files#diff-a6acaa0a744b3a7c841d9ab3ccbc1765f5749e3782fced61edced456c227b43fR180-R186

Which we can then call independently from the appropriate place in the text branch or the HTML branch (inside of TruncateCharsHTMLParser).

nessita · 2023-09-06T13:16:43Z

@smithdc1 Hey! Is this ready for re-review? If yes perhaps we should unset the needs improvement flag in the ticket.

nessita

Initial comments, looks good!

docs/releases/5.0.txt

tests/utils_tests/test_text.py

felixxm · 2023-09-07T10:19:12Z

@smithdc1 Hey! Is this ready for re-review? If yes perhaps we should unset the needs improvement flag in the ticket.

There are unresolved comments, e.g. #16421 (comment) or #16421 (comment).

nessita · 2023-09-07T11:44:31Z

There are unresolved comments, e.g. #16421 (comment) or #16421 (comment).

I believe the first one is already addressed in 4f1791c

nessita · 2024-01-15T19:44:04Z

@smithdc1 Hi! Happy New Year 🎆

Would you have time to keep working on this? If not I might try to push it to the finish line. Let me know! Thanks.

smithdc1 · 2024-01-15T19:54:39Z

Hi @nessita -- happy new year. Hope you had a good break! 🎆

Yes. Let me update this one. I'll have to remind myself of where I got to!

django/utils/text.py

felixxm · 2024-02-07T08:54:40Z

@smithdc1 Thanks for all your efforts 👍

ngnpope · 2024-02-07T09:08:46Z

Yes, thanks @smithdc1. This is a solid improvement. Also thanks for enduring my tweaks/reviews!

pauloxnet · 2024-02-21T11:44:54Z

Great improvement @smithdc1 👏🏻

smithdc1 commented Jan 3, 2023

View reviewed changes

tests/utils_tests/test_text.py Show resolved Hide resolved

django/utils/text.py Outdated Show resolved Hide resolved

tests/utils_tests/test_text.py Show resolved Hide resolved

felixxm reviewed Jan 4, 2023

View reviewed changes

tests/utils_tests/test_text.py Show resolved Hide resolved

smithdc1 force-pushed the ticket_30686 branch 2 times, most recently from 64e5739 to 9af6c3b Compare January 31, 2023 07:26

smithdc1 marked this pull request as ready for review February 8, 2023 07:52

carltongibson requested a review from matthiask February 8, 2023 08:21

ngnpope mentioned this pull request Feb 12, 2023

Improvements to truncating parser... smithdc1/django#3

Merged

felixxm reviewed Mar 8, 2023

View reviewed changes

django/utils/text.py Outdated Show resolved Hide resolved

adamchainz reviewed Mar 9, 2023

View reviewed changes

django/utils/text.py Outdated Show resolved Hide resolved

django/utils/text.py Outdated Show resolved Hide resolved

smithdc1 force-pushed the ticket_30686 branch from 84a43ad to 53aeccf Compare June 19, 2023 06:39

ngnpope suggested changes Jun 22, 2023

View reviewed changes

smithdc1 mentioned this pull request Jun 22, 2023

Corrected Spanish spelling in Truncator test. #16997

Closed

smithdc1 force-pushed the ticket_30686 branch from a0a2a82 to 59114e2 Compare June 23, 2023 05:37

smithdc1 force-pushed the ticket_30686 branch 2 times, most recently from c4f6095 to 4c09d1c Compare July 8, 2023 08:01

felixxm self-assigned this Jul 14, 2023

ngnpope suggested changes Jul 14, 2023

View reviewed changes

smithdc1 force-pushed the ticket_30686 branch from 4c09d1c to ae7cfb9 Compare July 21, 2023 08:36

smithdc1 force-pushed the ticket_30686 branch from ae7cfb9 to 4f1791c Compare August 18, 2023 13:49

nessita reviewed Sep 6, 2023

View reviewed changes

docs/releases/5.0.txt Outdated Show resolved Hide resolved

docs/releases/5.0.txt Outdated Show resolved Hide resolved

docs/releases/5.0.txt Outdated Show resolved Hide resolved

tests/utils_tests/test_text.py Show resolved Hide resolved

smithdc1 force-pushed the ticket_30686 branch from 4f1791c to b017059 Compare January 20, 2024 16:22

smithdc1 force-pushed the ticket_30686 branch 2 times, most recently from 5d6beea to b3a575c Compare February 3, 2024 11:02

felixxm force-pushed the ticket_30686 branch from b3a575c to e6b12b9 Compare February 6, 2024 19:41

felixxm reviewed Feb 6, 2024

View reviewed changes

django/utils/text.py Show resolved Hide resolved

felixxm force-pushed the ticket_30686 branch from e6b12b9 to e222f8f Compare February 7, 2024 07:28

Fixed #30686 -- Used Python HTMLParser in utils.text.Truncator.

6ee37ad

felixxm force-pushed the ticket_30686 branch from e222f8f to 6ee37ad Compare February 7, 2024 08:46

felixxm merged commit 6ee37ad into django:main Feb 7, 2024
35 checks passed

smithdc1 deleted the ticket_30686 branch February 7, 2024 10:59

ngnpope mentioned this pull request Mar 13, 2024

Refs #30686 -- Made django.utils.html.VOID_ELEMENTS a frozenset. #17971

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed #30686 -- Used Python HTMLParser in utils.text.Truncator #16421

Fixed #30686 -- Used Python HTMLParser in utils.text.Truncator #16421

smithdc1 commented Jan 3, 2023

smithdc1 commented Jan 8, 2023

smithdc1 commented Jan 31, 2023

carltongibson commented Feb 8, 2023

ngnpope commented Feb 12, 2023

carltongibson commented Mar 8, 2023

carltongibson commented Mar 8, 2023

matthiask commented Mar 8, 2023

claudep commented Mar 8, 2023

ngnpope left a comment

felixxm commented Jul 14, 2023

felixxm commented Jul 14, 2023

ngnpope left a comment

ngnpope Jul 14, 2023

nessita commented Sep 6, 2023

nessita left a comment

felixxm commented Sep 7, 2023

nessita commented Sep 7, 2023

nessita commented Jan 15, 2024

smithdc1 commented Jan 15, 2024

felixxm commented Feb 7, 2024

ngnpope commented Feb 7, 2024

pauloxnet commented Feb 21, 2024

Fixed #30686 -- Used Python HTMLParser in utils.text.Truncator #16421

Fixed #30686 -- Used Python HTMLParser in utils.text.Truncator #16421

Conversation

smithdc1 commented Jan 3, 2023

smithdc1 commented Jan 8, 2023

smithdc1 commented Jan 31, 2023

carltongibson commented Feb 8, 2023

ngnpope commented Feb 12, 2023

carltongibson commented Mar 8, 2023

carltongibson commented Mar 8, 2023

matthiask commented Mar 8, 2023

claudep commented Mar 8, 2023

ngnpope left a comment

Choose a reason for hiding this comment

felixxm commented Jul 14, 2023

felixxm commented Jul 14, 2023

ngnpope left a comment

Choose a reason for hiding this comment

ngnpope Jul 14, 2023

Choose a reason for hiding this comment

nessita commented Sep 6, 2023

nessita left a comment

Choose a reason for hiding this comment

felixxm commented Sep 7, 2023

nessita commented Sep 7, 2023

nessita commented Jan 15, 2024

smithdc1 commented Jan 15, 2024

felixxm commented Feb 7, 2024

ngnpope commented Feb 7, 2024

pauloxnet commented Feb 21, 2024