• Login
Community
  • Login

Find matching word between two text file

Scheduled Pinned Locked Moved General Discussion
6 Posts 3 Posters 5.1k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • F
    Frederick Smith
    last edited by Sep 23, 2018, 3:13 AM

    Hi guys, I need some help here.
    Let say I’ve 2 text file each with 100’s of lines.
    Text file A and B.
    I want to match all words in A to B … so any words in A if find in B highlighted.
    Example: lat say in file A - have a word: CAR… in file B - have a word: CARPOOL…
    It would match the word: CAR - and highlight it. (so only CAR would be highlighted - not: word: CARPOOL.
    Or … all matching words to be saved to a new file…either would be great.
    So file A being the "source… match any/all words from file A to file B (if exist)
    I tried Compare - but it’s show the difference… I would need the match.
    Thanks for your help in advance.
    Frederick

    1 Reply Last reply Reply Quote 0
    • T
      Terry R
      last edited by Terry R Sep 23, 2018, 8:07 AM Sep 23, 2018, 8:05 AM

      Hi @Frederick Smith
      your problem was interesting. It was very similar in nature to a solution presented by @guy038, namely:
      https://notepad-plus-plus.org/community/topic/16335/multiline-replace-multiple-hosts-in-hostsfile
      In that instance the question was how to remove lines when duplicates found. In essence though the search method here works very close to that one.

      I’m going to assume that the file A contents is 1 word per line, if not then we need file A in that format (When you copy lines you ONLY want the word which is duplicated, not additional words on the same line). So you would need file A opened first, then put a “—” line at the bottom, make it the last line. Then below it add file B.

      Open the Mark function and use the following:
      Find What: (?is)^(.+)\R(?=.*---.*\1)
      You need search mode set to regular expression (very important) and wrap around ticked. Also tick Bookmark Lines, this will help later.

      Have the cursor set at the top left most position of the file, so top of file A contents, otherwise the result will be unpredictable. You will only need to click on the Mark All button once. Any of the file A contents which also appear in file B area (below the — line) will be marked and also the line will be bookmarked (blue circle in the margin). The — line stops attempts to find duplicates in file B area.

      Now use the “Search” menu option, select “Bookmark”, then “Copy Bookmarked Lines”. Put the copied lines elsewhere, which is what you requested.

      My regex includes the (?is) modifier, s means CRLF (carriage return line feed) character is treated like ALL other characters, i means do an insensitive search. Insensitive means “CAR” would also find “car”, “Car”, “cAr” etc.

      I hope this helps, otherwise come back with more info including samples of actual file A and B contents if you can.

      Terry

      1 Reply Last reply Reply Quote 4
      • F
        Frederick Smith
        last edited by Sep 25, 2018, 4:06 PM

        Hi Terry,
        Thanks a lot for taking the time and responding to my question.
        First - you’re correct an your assumption.
        ALMOST THERE…
        First I tried, didn’t work, - then looking at your function code - realized it calls for: “—” (3) not “-” , so once I changed that it WORKED!
        With one exception!
        The only one thing is that it Marks the file A part - not file B part -
        (and I would need file B part to be marked)

        • I tried flipping around the files., but that didn’t work.

        This is not a real files…just a sample to illustrate…

        This is file A:
        car
        apple
        beach
        hello
        down
        sun
        question

        This is file B:
        city
        whatever
        carpool
        san
        beachcity
        cornel
        downpillow

        I opened FileA - and made to this:
        car
        apple
        beach
        hello
        down
        sun
        question

        city
        whatever
        carpool
        san
        beachcity
        cornel
        downpillow

        So,instead mark: car, beach down
        Would need mark: carpool, beachcity, downpillow
        So “car” would be highlighted in: “carpool”

        So how to change the “Mark” function to do that result?

        Thanks again Terry!

        1 Reply Last reply Reply Quote 0
        • G
          guy038
          last edited by Sep 25, 2018, 7:38 PM

          Hello, @frederick-Smith; @terry-r and All,

          Of course, with your additional information, it becomes easier to point out the suitable regex ! I hope that Terry won’t mind if I reply to you, first ;-))


          Actually, you have two files : File_A which contains a list of strings, which, possibly, are subsets of some words contained in the File_B list !

          Then, we’re going to reverse the logic :

          • First, in a new N++ tab, copy/paste the File_B.txt contents

          • Add the single line ---

          • Then, under this line, insert the File_A contents

          • Open the Mark dialog

          • Use the regex search :

          (?si)(.+)(?=.*^---\R.*^\1$)

          • Preferably, tick the Purge for each search option

          • Click on the Mark All button


          So, given File_B contents, below :

          city
          whatever
          carpool
          san
          beachcity
          cornel
          downpillow
          

          and File_A contents, below :

          car
          cornel
          apple
          beach
          hello
          ever
          down
          sun
          it
          question
          

          Just note that I added 3 words ever, cornel and it, in order to show that “subset-words” can be marked, also, in middle or at end of the whole word or that the entire word can be highlighted !

          Now, we add, in a new tab, the following text :

          city
          whatever
          carpool
          san
          beachcity
          cornel
          downpillow
          ---
          car
          cornel
          apple
          beach
          hello
          ever
          down
          sun
          it
          question
          

          Finally, using the Mark dialog and the regex (?si)(.+)(?=.*^---\R.*^\1$), it should higlight the bold words, below :-))

          city
          whatever
          carpool
          san
          beachcity
          cornel
          downpillow

          Notes :

          • As usual, the (?si) modifiers mean an insensitive to case search and that any dot ( . ) will match any single character ( Standard and EOL )

          • Then, the main part (.+) try to match the longest, non-null, amount of characters, even in several lines, stored as group 1, but ONLY IF the positive look-around (?=.*^---\R.*^\1$) is TRUE. That is to say, IF it detects :

            • A range of any character, possibly empty, .* ,

            • followed with a line with, only, 3 dashes and its line-break, ^---\R ,

            • followed, again, with the longest range, possibly null, of any character, .* ,

            • and ended with the contents of group 1, alone on its line, ^\1$


          Remark : if you prefer a sensitive to case search, simply use the first part (?s-i), instead !

          Cheers,

          guy038

          1 Reply Last reply Reply Quote 5
          • T
            Terry R
            last edited by Sep 25, 2018, 8:26 PM

            @Frederick-Smith said:

            It would match the word: CAR - and highlight it. (so only CAR would be highlighted - not: word: CARPOOL.

            I interpreted that as being the word in file A being highlighted, so what you really meant was the letters CAR in carpool would be highlighted as CAR also existed in file A. Sorry about that and the confusion over the 3 “-”, sometimes characters don’t show well, it’s the interpreter (behind the compose window) that causes most of the issues. As @guy038 has given you another solution to fit your requirements I’ll let it be.

            Be sure to come back if anyone that elaborate, or help further.

            Terry

            1 Reply Last reply Reply Quote 1
            • F
              Frederick Smith
              last edited by Sep 26, 2018, 3:30 PM

              Hi @terry-r, @guy038 and All

              First I want to thank you both: @terry-r and @guy038 - for taking your time and giving me help.

              Both solution works - maybe a bit different - but both gives the good results what I was looking for.

              Let me say, how much I appreciate the community. Thanks you!

              Thanks again guys!

              1 Reply Last reply Reply Quote 1
              4 out of 6
              • First post
                4/6
                Last post
              The Community of users of the Notepad++ text editor.
              Powered by NodeBB | Contributors