# Regex: select/match the numbers that are repeated most often browsing

• hello, I have 15 rows with 7 numbers, from 1 to 50. How can I match the 4 numbers that are repeated most often in all those 15 rows?

I suppose I must first select all numbers `\d+`
then, I have to divide all 2-digit numbers `\b[1-9]{2}\b` by all 1-digit numbers `\b[1-9]{1}\b`
or, I should select all numbers from `1-10`, then all numbers from `10-20` …and from `40-50`

I don’t know exactly, there should be a mathematics formula. In Excel I can use filters for this, or Sort from lowest to highest, etc

But how can I do this with regex?

• @Vasile-Caraus
It isn’t entirely clear what you mean by “match”. Do you simply want to know which 4 numbers appear most often or something else altogether?

If you know how to do this in Excel, then if I were you, I would import the document into Excel and do whatever it is you want done.

Otherwise, you should probably use a scripting language (AWK, PERL, Python, etc.). This doesn’t sound like a task that regex is best suited to do.

• posted

Hello, Vasile,

I tried to guess, first, what you wanted to achieve and after getting random numbers from Net, I spent some hours, from time to time, to imagine a method ! And, luckily, I succeeded to find a solution, with the help of the `Random.org` site, which allows you to obtain the most frequent integers used, in a table of 10,000 integers maximum, with value between 1 and 9999 maximum

On the `Random.org` site, the value or random numbers can be, in the range ±1,000,000,000, but, due to some necessary regexes, I preferred to limit this range, between 1 and 9999

As your table of numbers contains 15 rows of 7 columns, the total number of integers, with value between 1 and 50 , is 105

So, go, first, to the `Random.org` site, from the address, below :

https://www.random.org/integers/

I typed ( in `red` colour ) the following answers :

• Generate `105` random integers (maximum 10,000).

• Each integer should have a value between `1` and `50` (both inclusive; limits ±1,000,000,000).

• Format in `7` column(s).

• Note: The numbers are generated left to right !

And I clicked on the Get Numbers button

I got a 15 x 7 table of 105 random integers, below, that I copied/pasted in a new tab, in N++

``````2   27  7   11  32  6   7
8   45  50  19  37  40  47
21  11  50  46  50  27  49
41  13  36  3   37  29  23
25  22  47  3   37  2   29
8   48  29  46  24  18  9
46  8   24  19  5   22  27
29  26  44  47  22  22  5
22  25  35  47  48  24  3
10  20  28  49  7   24  3
37  27  4   40  44  45  14
4   44  15  43  46  32  7
47  15  11  17  16  42  8
28  44  43  24  17  8   5
32  27  11  1   35  28  29
``````

In that outputed list, the integers are separated with a single tabulation character. As I intended to sort these values, I needed, first, to put all the values, in a one column table.

Moreover, it was necessary to use a template, with possible leading zeros, in order to sort, later, these integers, correctly ! So :

• A one-digit integer, was changed into the integer `000#`
• A two-digits integer, was changed into the integer `00##`
• A three-digits integer, was changed into the integer `0###`
• A four-digits integer, was changed into the integer `####`

The regex S/R, which can realized these two goals, was :

SEARCH `^(\d(\d(\d(\d)?)?)?)(?:\t|\R)`

REPLACE `(?2:0)(?3:0)(?4:0)\1\r\n`

After clicking, ONCE, on the Replace All button, I got the list, of 105 integers, below :

``````0002
0027
0007
0011
0032
0006
0007
0008
0045
0050
0019
0037
0040
0047
0021
0011
0050
0046
0050
0027
0049
0041
0013
0036
0003
0037
0029
0023
0025
0022
0047
0003
0037
0002
0029
0008
0048
0029
0046
0024
0018
0009
0046
0008
0024
0019
0005
0022
0027
0029
0026
0044
0047
0022
0022
0005
0022
0025
0035
0047
0048
0024
0003
0010
0020
0028
0049
0007
0024
0003
0037
0027
0004
0040
0044
0045
0014
0004
0044
0015
0043
0046
0032
0007
0047
0015
0011
0017
0016
0042
0008
0028
0044
0043
0024
0017
0008
0005
0032
0027
0011
0001
0035
0028
0029
``````

Using the menu option Edit > Line Operations > Sort Lines Lexicographically Ascending, I obtained the sorted text, below :

``````0001
0002
0002
0003
0003
0003
0003
0004
0004
0005
0005
0005
0006
0007
0007
0007
0007
0008
0008
0008
0008
0008
0009
0010
0011
0011
0011
0011
0013
0014
0015
0015
0016
0017
0017
0018
0019
0019
0020
0021
0022
0022
0022
0022
0022
0023
0024
0024
0024
0024
0024
0025
0025
0026
0027
0027
0027
0027
0027
0028
0028
0028
0029
0029
0029
0029
0029
0032
0032
0032
0035
0035
0036
0037
0037
0037
0037
0040
0040
0041
0042
0043
0043
0044
0044
0044
0044
0045
0045
0046
0046
0046
0046
0047
0047
0047
0047
0047
0048
0048
0049
0049
0050
0050
0050
``````

Then, I found a regex, in order to put all the same numbers, in an unique line. For instance, the four numbers 0003, in four consecutive lines, were displayed, after replacement, in the single line 0003 0003 0003 0003. So :

SEARCH `(\d{4})\R\1`

REPLACE `\1 \1` , with a space character, between the two back-references, `\1`

IMPORTANT : You must click, TWICE, on the Replace All button, in order to end this S/R

REMARK :

• If each number occurs ONCE or TWICE, only, in the current random list, you may, already, get the message : Replace All: 0 occurrences were replaced, while clicking a second time, on the Replace All button !

Thus, after TWO clicks on the Replace All button, that list was changed into this new one, below :

``````0001
0002 0002
0003 0003 0003 0003
0004 0004
0005 0005 0005
0006
0007 0007 0007 0007
0008 0008 0008 0008 0008
0009
0010
0011 0011 0011 0011
0013
0014
0015 0015
0016
0017 0017
0018
0019 0019
0020
0021
0022 0022 0022 0022 0022
0023
0024 0024 0024 0024 0024
0025 0025
0026
0027 0027 0027 0027 0027
0028 0028 0028
0029 0029 0029 0029 0029
0032 0032 0032
0035 0035
0036
0037 0037 0037 0037
0040 0040
0041
0042
0043 0043
0044 0044 0044 0044
0045 0045
0046 0046 0046 0046
0047 0047 0047 0047 0047
0048 0048
0049 0049
0050 0050 0050
``````

Finally, I had to get rid of all the numbers, which were present, less than four times ! Indeed, only the integers, repeated, at least, four times, in that list, seemed useful. The suitable S/R to do so, is :

SEARCH `^(?!(\d{4})( \1){3}).+\R`

REPLACE `EMPTY`

NOTE :

• The general regex `^(?!(\d{4})( \1){N}).+\R`, delete all the lines, where current number is present, between 1 and N times, maximum. So :

• If N = 1, every number, present ONCE, in the list, will be deleted
• If N = 2, every number, present ONCE or TWICE, in the list, will be deleted
• If N = 3, every number, present ONCE, TWICE or THREE times, in the list, will be deleted
• If N = 4, every number, present, between ONCE and FOUR times, in the list, will be deleted
• And so on…

After clicking ONCE, on the Replace All button, I got the final text, below :

``````0003 0003 0003 0003
0007 0007 0007 0007
0008 0008 0008 0008 0008
0011 0011 0011 0011
0022 0022 0022 0022 0022
0024 0024 0024 0024 0024
0027 0027 0027 0027 0027
0029 0029 0029 0029 0029
0037 0037 0037 0037
0044 0044 0044 0044
0046 0046 0046 0046
0047 0047 0047 0047 0047
``````

Finally, from this text, it’s quite obvious to deduce that the more frequent numbers, in that random list of 105 numbers, are the six integers 8, 22, 24, 27, 29 and 47, which are present five times :-))

A second example :

I will not give details about it. I’ll just give the original random list of integers and the final list of the most frequent integers found

Let’s suppose a list of 300 integers, with values from 1 to 150, placed in 15 rows of 20 columns, each, below :

``````56  142 24  68  122 132 35  127 56  29  119 97  3   143 21  72  138 109 18  124
51  42  144 5   100 39  60  12  101 94  16  118 108 61  29  125 150 67  60  57
22  82  148 9   29  111 138 123 108 130 47  1   141 75  107 124 58  24  47  46
121 78  107 51  92  21  114 75  105 62  114 7   89  77  63  39  21  131 126 107
50  13  85  26  33  103 112 74  122 62  11  86  22  90  53  143 74  122 26  109
96  128 148 85  3   18  88  132 90  86  150 118 80  20  41  147 91  6   3   45
143 139 145 52  150 111 132 73  86  30  125 28  66  24  61  41  76  108 16  51
138 78  50  52  125 88  11  145 13  25  111 15  103 124 94  2   1   80  74  6
58  14  78  6   27  39  75  117 69  98  53  1   71  11  60  15  21  115 129 2
10  147 8   45  20  90  41  29  3   101 44  116 52  39  141 132 102 33  57  110
21  43  16  33  51  59  78  116 116 23  50  18  114 106 8   93  96  25  6   71
6   31  58  49  114 91  17  9   30  99  113 137 16  131 29  102 40  133 34  147
98  7   81  127 136 132 126 69  48  5   54  128 94  85  11  134 71  92  108 37
54  121 118 65  124 58  122 130 67  77  26  65  136 14  149 146 117 54  60  20
147 103 28  129 32  94  139 111 122 74  146 86  83  100 75  100 48  48  99  112
``````

At the end, after the third regex S/R , you should get the final text, below :

``````0003 0003 0003 0003
0006 0006 0006 0006 0006
0011 0011 0011 0011
0016 0016 0016 0016
0021 0021 0021 0021 0021
0029 0029 0029 0029 0029
0039 0039 0039 0039
0051 0051 0051 0051
0058 0058 0058 0058
0060 0060 0060 0060
0074 0074 0074 0074
0075 0075 0075 0075
0078 0078 0078 0078
0086 0086 0086 0086
0094 0094 0094 0094
0108 0108 0108 0108
0111 0111 0111 0111
0114 0114 0114 0114
0122 0122 0122 0122 0122
0124 0124 0124 0124
0132 0132 0132 0132 0132
0147 0147 0147 0147
``````

Now, not difficult to see that the more frequent numbers, in that random list of 300 numbers, between 1 and 150, are the five integers 6, 21, 29, 122 and 132, which are present five times :-))

A third example ( without explanations, just try ! )

Let’s suppose a list of 100 integers, with values from 1 to 999, placed in 10 rows of 10 columns, each, below :

``````591 132 551 647 337 570 610 427 281 868
266 424 760 306 46  262 239 178 11  752
236 97  50  415 237 198 444 63  77  602
189 562 36  334 822 704 759 242 651 306
39  998 172 606 973 846 854 687 759 304
865 50  5   583 685 888 510 468 742 144
612 948 538 802 531 657 300 779 817 392
227 231 984 466 670 203 852 879 164 775
362 211 981 675 889 273 86  184 485 643
180 390 690 292 906 902 245 933 679 931
``````

The last S/R is, even, useless, because the numbers are, mostly, present ONCE, only !

=> The most frequent numbers, in that random list of 100 numbers, between 1 and 999, are the three integers 50, 306 and 759, which are present two times !

A final example :

Let’s suppose a list of 1000 integers, with values from 1 to 30, placed in 50 rows of 20 columns, each, below :

``````14  3   10  12  28  16  19  10  3   25  2   14  8   8   27  8   1   20  27  13
25  30  5   13  25  8   9   29  4   7   19  7   13  18  18  23  25  8   15  4
7   17  15  27  17  1   19  12  5   22  7   18  2   20  11  6   22  26  2   20
22  20  8   27  26  26  6   29  19  22  17  12  22  7   27  1   16  24  3   29
26  7   9   16  2   8   3   11  5   17  4   20  2   5   16  11  17  7   2   1
15  20  11  11  5   11  18  24  3   10  2   30  29  23  17  21  14  12  5   11
27  10  16  2   15  22  26  8   12  21  18  16  4   2   5   27  18  28  17  3
10  2   27  4   20  19  14  11  18  16  29  2   11  7   1   29  29  6   18  26
26  10  30  21  6   10  7   6   30  27  2   5   25  25  22  24  17  8   16  21
13  27  16  19  16  21  28  23  30  24  12  24  5   30  14  5   21  2   22  11
20  2   19  21  29  23  21  8   21  15  26  22  28  22  13  27  1   6   14  7
11  20  3   17  9   4   9   5   7   18  21  20  11  14  21  22  6   29  22  21
21  25  7   20  28  18  1   30  4   25  28  10  24  23  8   9   17  24  6   11
21  10  28  24  1   24  29  8   7   28  1   14  10  23  14  12  28  30  21  11
13  11  3   18  30  15  2   13  29  14  22  17  30  16  17  9   24  8   11  23
29  7   21  3   25  23  17  28  25  30  26  19  25  29  6   15  20  9   30  17
23  26  30  16  5   21  22  13  24  24  16  27  24  5   1   28  25  26  21  11
9   5   3   23  19  3   7   30  3   9   25  29  12  3   14  19  23  25  26  20
6   9   14  15  12  27  2   2   27  28  23  25  13  1   13  16  24  10  28  6
5   8   5   6   24  20  22  15  9   6   19  26  27  15  15  21  12  24  27  9
22  5   18  18  23  25  20  7   9   7   21  21  24  19  21  1   7   14  20  8
5   7   23  3   26  10  8   27  26  3   5   2   27  15  29  2   28  18  5   19
19  18  14  26  15  23  2   18  4   7   5   30  5   9   8   17  27  2   24  21
21  27  11  25  20  5   28  4   26  3   9   13  4   22  26  4   30  9   13  14
24  29  11  6   26  20  30  1   2   11  2   7   20  10  3   26  4   3   4   27
26  30  4   9   13  9   15  28  23  1   10  1   3   30  27  29  4   28  11  8
3   1   27  23  30  30  6   14  15  28  7   29  24  8   23  8   4   15  24  10
17  18  27  19  17  29  25  7   5   8   21  22  24  8   15  16  10  29  7   12
1   18  19  3   22  1   13  16  26  27  4   3   16  30  7   13  14  8   28  4
17  10  8   11  6   8   13  13  27  19  14  21  28  26  26  20  26  5   30  14
22  23  9   28  11  21  12  3   11  7   26  16  14  4   20  24  15  12  13  4
12  24  8   9   25  1   29  5   24  24  13  1   5   26  14  19  12  27  19  17
12  14  7   6   3   26  24  11  19  1   1   2   3   13  19  8   18  14  3   13
29  25  14  30  12  22  14  14  20  12  2   2   13  26  7   28  12  26  2   13
13  23  22  6   11  1   25  23  12  18  24  1   10  17  23  4   28  14  6   13
27  7   25  2   25  27  12  14  10  7   8   9   19  1   19  14  10  29  17  5
9   8   30  12  25  16  3   14  26  30  7   27  2   15  3   28  4   11  6   2
28  13  3   14  15  18  22  11  18  30  19  6   24  30  22  14  8   29  2   13
27  2   1   8   23  24  5   1   1   24  23  17  6   25  17  2   16  26  19  13
18  22  21  27  10  13  7   27  4   8   30  15  11  3   27  26  22  22  5   17
14  28  27  14  11  2   14  8   26  4   2   28  4   25  29  10  16  23  6   10
21  23  4   19  25  13  4   26  8   3   27  2   19  2   30  8   25  1   1   4
8   15  19  19  25  4   7   7   21  13  24  21  26  13  14  22  6   9   10  26
7   29  25  17  11  4   8   30  26  6   5   8   23  16  13  23  17  2   21  4
24  4   13  25  12  12  13  16  19  11  19  11  30  6   19  7   12  10  18  14
1   7   20  19  28  1   28  6   7   9   21  7   11  9   10  7   1   16  27  20
27  16  30  21  23  25  25  5   22  13  15  27  26  22  4   28  13  25  18  29
7   5   25  19  28  19  20  18  10  1   30  24  13  13  29  16  8   8   15  25
7   20  12  18  9   9   17  13  19  18  29  9   14  3   20  29  28  18  21  19
18  21  4   15  20  7   20  24  6   27  3   10  27  14  15  7   4   22  7   17
``````

For the last S/R, I chose N = 38, because there are, only, 30 possible values and most numbers are, therefore, present, very often !

Hence, the last regex S/R is :

SEARCH `^(?!(\d{4})( \1){38}).+\R`

REPLACE `EMPTY`

=> The most frequent numbers, in that random list of 1000 numbers, between 1 and 30, are the six integers, below :

7 ( present 45 times ), 8 and 13 ( present 40 times ), 14 and 26 ( present 39 times ) and 27 ( present 41 times ) !

Best Regards,

guy038

But, you are the best.

Thanks A LOT ! WORKS !

• BUT, the only problem is that works on your exemples. Not at mine.

the `\R` from your regular expressions can be replace with other formula?

• This post is deleted!

• @guy038 said:

SEARCH ^(\d(\d(\d(\d)?)?)?)(?:\t|\R)
REPLACE (?2:0)(?3:0)(?4:0)\1\r\n

this regex of your `^(\d(\d(\d(\d)?)?)?)(?:\t|\R)` doesn’t work at my place. The first one and the most important. The other regex works fine.

But I find another way to do this. Suppose I have:

17 25 30 37 38 47
2 6 7 17 30 42
3 17 20 38 44 45
4 5 6 30 36 42

Search: `(Leave a single space)`
Replace by: `\r`

then

Search: `^(a*)` This will move the cursor at the beginning of each line
Replace by: 00

and I will get something like this:

0017
0025
0030
0037
0038
0047
002
006
007
0017
0030
0042
003
0017
0020
0038
0044
0045
004
005
006
0030
0036
0042

• @guy038 said:

SEARCH (\d{4})\R\1

REPLACE \1 \1 , with a space character, between the two back-references, \1

This, again, is not working at my place. `(\d{4})\R\1` And I press many time “Replace All” button

• @Vasile-Caraus

I know you are a regex fan but just to give you an idea how a python script
would look like to solve such a problem

``````from collections import Counter

x = editor.getText().replace('\r\n',' ').split(' ')  # get the list of numbers
y = [y for y in x if y !='']                         # get rid of the empty ones
counted_list = Counter(y)                            # create a list of tuples, counting each
for item in counted_list.most_common(4):             # iterate over the top 4
console.write('{}\n'.format(item))               # and print it to the console
``````

I used the list of 1000 integer @guy038 posted.
The result in the console would be

(‘7’, 45)
(‘27’, 41)
(‘8’, 40)
(‘13’, 40)

Meaning that number 7 occurred 45 times

Cheers
Claudia

• @Claudia-Frank said:

n idea how a pytho

hello Claudia, I don’t know Phyton, so I really don’t know what to do with the phyton script you write above.

• posted

Hello Claudia,

I’ve just tested, your Python solution, changing for the six most common used numbers, with the `counted_list.most_common(6)` expression and it just return all the numbers that I’ve had previously found, for the 1000 random integers list :-)

How elegant a Python ( or Lua, I suppose ) script is, compared to my complicated regex’s cooking !!!

Cheers,

guy038

• Claudia and guy038, please tell me how to use this python script !

• a short tutorial for this example will be great !

• @Vasile-Caraus

What needs to be done first is described here.

Just in case that you haven’t installed python script plugin yet, I would propose to use the MSI package instead of using the plugin manager.

Short version, once python script plugin has been installed goto
Plugins->Python Script->New Script
give it a name and press save.
A new empty editor should appear.
Copy the content into it and save it.
Do NOT reformat the code as python is strict about whitespaces.

Open the python script console by clicking on
Plugins->Python Script->Show Console

Open your file with the numbers and run the script by clicking on
Plugins->Python Script->Scripts->NAME_OF_YOUR_SCRIPT
Cheers
Claudia

• WORKS GREAT Claudia.

Thanks a lot !

• by the way, Claudia, how can I use Python (like your script) to actually modify the .txt file. Because, for now, Python only show in the console the results of some function from the script. But how can I use Python script to search and replace something in the .txt files?

• @Vasile-Caraus

if you want to dive into python first thing, of course, is to get some basic knowledge of the language it self.
Note, the plugin uses python2 NOT 3 (there are differences, nothing too critical but those can be confusing
if you start learning the language and you try to do something which works in py3 but not in py2).

Next the help pages which come with the plugin itself.
Plugins->Python Script->Context-Help

And last but not least Scintillas help at http://www.scintilla.org/ScintillaDoc.html to get a better
understanding how the editor works.

The console is a good starting point to test things first.
In order to get all functions, attributes of a py object you can use the dir command.
So, if you do the following in the console you will get the list of functions of this object

``````dir(editor)
``````

I prefer to have not to scroll sideways so I use

``````print '\n'.join(dir(editor))
``````

In order to see what the parameters of a function are use the help command like

``````help(editor.insertText)
``````

Next if you search the forum you will find many scripts to solve some particular issues
one of my first posts answered a question to unit conversion

and finally, ask the question here if you have a specifc question.

Cheers
Claudia

Ahh… I would suggest to do the following changes in notepad
Settings->Preferences->Language check the “replace by space” because
Python don’t like it if you use tabs and spaces for indentation.

• @Claudia-Frank

Regarding print ‘\n’.join(dir(editor))

I don’t think that ‘print’ outputs to the Pythonscript console window by default.

From the following in the original startup.py:

# This sets the stdout to be the currently active document, so print “hello world”,
# will insert “hello world” at the current cursor position of the current document
sys.stdout = editor

This is of dubious value, especially since a ‘print’ used in this way inserts the text specified plus a UNIX-style line ending into your current file (which likely has Windows-style line endings!).

I, and likely also Claudia, have changed this line in startup.py to be:

sys.stdout = console

thus changing ‘print’ statements to output their data to the Pythonscript console (great for debugging your scripts!)

As alluded to above, the Pythonscript console seems to use UNIX-style line endings. I found this out in an odd way. If you copy-and-paste from the console to an editing window with Windows line endings, the line-endings on the source text will be changed at the time of the paste to match the destination file format, so all is good. HOWEVER, what I did one time was to paste via the “Clipboard History” window. This action seems to preserve the original UNIX-style line endings at the destination! I was quite confused as to why I had inconsistent line-endings in my document, until I figured it out.

• @Scott-Sumner

Scott, you are absolutely correct, I’ve changed this in startup.py
and for me this is much more convenient than using console.write to
print chars to the console.
Just a side not, the command
print ‘\n’.join(dir(editor))
should have been executed in the console itself and there it is working
but if some would use it in a script, than it would print to editor unless
you do changes Scott mentioned.

Thx for the info about copy/paste - I do this a lot but luckily I didn’t use the history ;-)

Cheers
Claudia

Internal error.

Oops! Looks like something went wrong!