Regex: Finds words that are repeated in multiple lines
hello. I have this lines with regex expressions, separated by
|, of type
(?s)((^.*)(<div class="entry-excerpt">)|(<!-- //.entry -->)(.*$)) (?s)((^.*)(<ul class="smallThumb-mainList">)|(<div class="navigation">)(.*$)) (?s)((^.*)(word_2)|(<!-- //.entry -->)(.*$)) (?s)((^.*)(word_2)|(<!-- //.ambro34 -->)(.*$))
I want to find all those words\regex that are repeated before | and those that repeats after |
I try a regex, but doesn’t work too good:
Basic, I want after search and replace to remain only one instance of:
(?s)((^.*)(word_2)because is repeated 2 times before
|(on line 3 and 4)
(<!-- //.entry -->)(.*$))because is repeated after
|(on line 1 and 3)
Maybe, a simple example will be much better:
Word_1 | Word_2
Word_3 | Word_2
Word_4 | Word_5
Word_4 | Word_6
In this case, Word_4 and Word_2 are repeated. So, I want after search to remain only this ones.
As stated before here (https://notepad-plus-plus.org/community/topic/13248/regex-datetime) I think you’ve worn out everyone’s good nature (with the possible exception of @guy038) with your infinite regex questions. @MAPJe71 pointed out some good references for you to self-learn; that advice still holds. Sorry, but that’s the way I see it.
First of all, @alan-kilborn and @MapJe71, although I do understand your point of view and the advices that you give to @Vasile-Caraus, this present exercise seems, however, interesting. You may simply consider that it would allow you to know, in a two-columns table, any text which is repeated, one or more times, in each column !
So @Vasile-Caraus, let’s go !
To begin with, some statements and hypotheses :
I’ll limit this topic to the general case of two parts of text, only, separated with one Vertical Line character (
Text_A|Text_B), which, of course, matches the sub-problem of two regexes, separated by the alternative symbol (
For syntaxes, as
Text_A|Text_B|Text_Cor more, it would be more expensive !! Well, set your mind at ease, I’m joking :-))
Of course, these two parts of text do NOT contain the Vertical Line character (
|), themselves !
I chose the Commercial At sign as a temporary character. If your regexes may contain this character, just choose an other symbol, which, preferably, won’t be a special regex symbol !
I’ll use the 12-lines original text, below :
Text_0|Text_C Text_1|Text_2 Text_4|Text_5 Text_3|Text_2 Text_4|Text_6 Text_7|Text_8 Text_9|Text_2 Text_4|Text_5 Text_7|Text_A Text_0|Text_B Text_2|Text_7 Text_6|Text_7
- Of course, the different NON-null strings Text_? can have any size !
Open a new tab
Copy/Paste the original text, above
Hit the Backspace key to suppress the possible End of Line character(s), of the last line ( Line 12 )
Open the Replace dialog
firstregex S/R, below :
should produce the text :
Text_0@A-@|Text_C@B-@ Text_1@A-@|Text_2@B-@ Text_4@A-@|Text_5@B-@ Text_3@A-@|Text_2@B-@ Text_4@A-@|Text_6@B-@ Text_7@A-@|Text_8@B-@ Text_9@A-@|Text_2@B-@ Text_4@A-@|Text_5@B-@ Text_7@A-@|Text_A@B-@ Text_0@A-@|Text_B@B-@ Text_2@A-@|Text_7@B-@ Text_6@A-@|Text_7@B-@
Now, choose the Edit > Column Editor…, or hit the
ALT + Cshortcut
Select the zone Number to Insert
Choose 1, as Initial number
Choose 1, in the Increase by field
Select the Dec format of numbers
Place the caret, on the first line, between the strings
Click on the OK button
=> A list of numbers, between 1 and 12, is inserted at caret position
Now, move the caret, on the first line, between the strings
@B-and the last
Re-open the Column Editor, with the
ALT + Cshortcut
Hit the Enter key
=> The same list of numbers is inserted, before the last
@, of each line :
Text_0@A-1 @|Text_C@B-1 @ Text_1@A-2 @|Text_2@B-2 @ Text_4@A-3 @|Text_5@B-3 @ Text_3@A-4 @|Text_2@B-4 @ Text_4@A-5 @|Text_6@B-5 @ Text_7@A-6 @|Text_8@B-6 @ Text_9@A-7 @|Text_2@B-7 @ Text_4@A-8 @|Text_5@B-8 @ Text_7@A-9 @|Text_A@B-9 @ Text_0@A-10@|Text_B@B-10@ Text_2@A-11@|Text_7@B-11@ Text_6@A-12@|Text_7@B-12@
Then, with that
secondregex S/R :
we get the one-column list, below :
Text_0@A-1 @ Text_C@B-1 @ Text_1@A-2 @ Text_2@B-2 @ Text_4@A-3 @ Text_5@B-3 @ Text_3@A-4 @ Text_2@B-4 @ Text_4@A-5 @ Text_6@B-5 @ Text_7@A-6 @ Text_8@B-6 @ Text_9@A-7 @ Text_2@B-7 @ Text_4@A-8 @ Text_5@B-8 @ Text_7@A-9 @ Text_A@B-9 @ Text_0@A-10@ Text_B@B-10@ Text_2@A-11@ Text_7@B-11@ Text_6@A-12@ Text_7@B-12@
Now, let’s use the menu option Edit > Line Operations > Sort lines Lexicographically Ascending
We obtain the sorted text, below :
Text_0@A-1 @ Text_0@A-10@ Text_1@A-2 @ Text_2@A-11@ Text_2@B-2 @ Text_2@B-4 @ Text_2@B-7 @ Text_3@A-4 @ Text_4@A-3 @ Text_4@A-5 @ Text_4@A-8 @ Text_5@B-3 @ Text_5@B-8 @ Text_6@A-12@ Text_6@B-5 @ Text_7@A-6 @ Text_7@A-9 @ Text_7@B-11@ Text_7@B-12@ Text_8@B-6 @ Text_9@A-7 @ Text_A@B-9 @ Text_B@B-10@ Text_C@B-1 @
thirdregex S/R, below :
should delete any text, which is unique, in its column and keeps, only, the different texts, which occur several times, in their column :
Text_0@A-1 @ Text_0@A-10@ Text_2@B-2 @ Text_2@B-4 @ Text_2@B-7 @ Text_4@A-3 @ Text_4@A-5 @ Text_4@A-8 @ Text_5@B-3 @ Text_5@B-8 @ Text_7@A-6 @ Text_7@A-9 @ Text_7@B-11@ Text_7@B-12@
Finally, use the
fourthand last regex S/R, below :
You may replace any syntax
\x20with a single space character !
In the replacement regex, you may add some other spaces or replace the spaces by several tabulation characters
This S/R displays the different texts :
With the syntax
Text_?|, if this text was located BEFORE the Vertical Line symbol
With the syntax
|Text_?, if this text was located AFTER the Vertical Line symbol
The number, ending each line, represents, by increasing order, the number of each line, where the string
Text_?occurs, in order to easily localize this string !
Text_0| 1 Text_0| 10 |Text_2 2 |Text_2 4 |Text_2 7 Text_4| 3 Text_4| 5 Text_4| 8 |Text_5 3 |Text_5 8 Text_7| 6 Text_7| 9 |Text_7 11 |Text_7 12
If any of the four S/R, above, seems a bit tricky, just tell me about it !
Test it and it WORKS. I believe I will use Macros for this long regex.
thanks, guy038. I believe you are my only friend around here. ;)