Xojo Conferences
MBSSep2018MunichDE
XDCMay2019MiamiUSA

RegEx mystery (Real Studio network user group Mailinglist archive)

Back to the thread list
Previous thread: Re: Cant Access Real Studio web site
Next thread: Help system hacked?


No UTF32?   -   Rubber Chicken Software Co.
  RegEx mystery   -   Franck Perez
   Re: RegEx mystery   -   Franck Perez
   Re: RegEx mystery   -   Kem Tekinay
   Re: RegEx mystery   -   Stéphane Mons <

RegEx mystery
Date: 25.09.11 11:03 (Sun, 25 Sep 2011 12:03:43 +0200)
From: Franck Perez
Dear List,

I am always kind of lost with RegEx, but here more than ever.
I have a text in a TextField that contain lines that looks like

Seq_1 1 GTTAGGCGTTTTGCGCTGCTTCGCGATGTACGGGCCAGATATACGCGTTGACATTGATTA
60

I need to extract the starting number (here 1) and the sequence itself
(without surrounding spaces)
I use RegEx search as follows.

RegExpObj.options.greedy = false // disable greediness
RegExpObj.options.DotMatchAll = true
RegExpObj.searchPattern = "Seq_1\s*(\d*)\s*([A-Za-z]*)(.*)$" // \s*$"
//find a Seq1 line
RegExpMatchObj = RegExpObj.Search (AlignedField.text, SearchStart)
if RegExpMatchObj<>nil then
Dim testStr0, testStr1, TestStr2, TestStr3 as String

SearchStart= RegExpMatchObj.SubExpressionStartB(0)+1
ValNtBegining1.Append val(RegExpMatchObj.SubExpressionString(1)) //to
find the char where the line starts
testStr0 = RegExpMatchObj.SubExpressionString(0)
testStr1 = RegExpMatchObj.SubExpressionString(1)
testStr2 = RegExpMatchObj.SubExpressionString(2)
testStr3 = RegExpMatchObj.SubExpressionString(3)
end if

Results:
TestStr0 = Seq_1 1
atggtgagcaagggcgaggagctgttcaccggggtggtgcccatcctggtcgagctggac 60
TestStr1 = empty
TestStr2 = empty
TestStr3 = 1
atggtgagcaagggcgaggagctgttcaccggggtggtgcccatcctggtcgagctggac 60

I am fully confused here.
How can the RegEx find its target (not nil) and at the same time
return RegExpMatchObj.SubExpressionString(1)
and RegExpMatchObj.SubExpressionString(2) as empty strings and give
a RegExpMatchObj.SubExpressionString(3) that still contains the spaces
present before "1" as well as "1" itself ??

Thanks for your helps,
best,
Franck.
_______________________________________________
Unsubscribe or switch delivery mode:
<http://www.realsoftware.com/support/listmanager/>

Search the archives:
<http://support.realsoftware.com/listarchives/lists.html>

Re: RegEx mystery
Date: 25.09.11 17:51 (Sun, 25 Sep 2011 18:51:33 +0200)
From: Franck Perez
Many thanks. Clearly, I do not handle correctly the greedy parameter.
It now work very well.
Just to finish up the discussio, the only problem left is that I should have
tell you that I have a text that have several lines to be parsed. like :

Seq_1 1 GACGAAAGGGCCTCGTGATACGCCTATTTTTATAGGTTAATGTCATGATAATAATGGTTT
60

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

Seq_2 1 GACGAAAGGGCCTCGTGATACGCCTATTTTTATAGGTTAATGTCATGATAATAATGGTTT
60


Seq_1 61 CTTAGACGTCAGGTGGCACTTTTCGGGGAAATGTGCGCGGAACCCCTATTTGTTTATTTT
120

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

Seq_2 61 CTTAGACGTCAGGTGGCACTTTTCGGGGAAATGTGCGCGGAACCCCTATTTGTTTATTTT
120


Seq_1 121 TCTAAATACATTCAAATATGTATCCGCTCATGAGACAATAACCCTGATAAATGCTTCAAT
180

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

Seq_2 121 TCTAAATACATTCAAATATGTATCCGCTCATGAGACAATAACCCTGATAAATGCTTCAAT
180


Seq_1 181 AATATTGAAAAAGGAAGAGTATGAGTATTCAACATTTCCGTGTCGCCCTTATTCCCTTTT
240

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

Seq_2 181 AATATTGAAAAAGGAAGAGTATGAGTATTCAACATTTCCGTGTCGCCCTTATTCCCTTTT
240

when I use both your methods, I correctly get the$1 and $2 but, probably
because of the greedy parameter I have all the rest of the text in the $3
and not only "60"
The problem is that I used RegExpObj.options.DotMatchAll = true
If I use RegExpObj.options.DotMatchAll = false, then everything is all
right.

Thanks again,
best,
Franck





On Sun, Sep 25, 2011 at 6:00 PM, Kem Tekinay
<<email address removed>>wrote:

> Replace the "*" (zero or more) with "+" (1 or more), then turn "greedy" on
> and see what happens. When I test that here in RegExRX, I get the following
> matches:
>
> $1 = "1"
> $2 = "GTTAGGCGTTTTGCGCTGCTTCGCGATGTACGGGCCAGATATACGCGTTGACATTGATTA"
> $3 = "60"
>
> I think that's what you're looking for.
>
> The logic behind "greedy" is essentially that the regex engine will work
> forwards when it's off, and backwards when it's on and present you with
> either the first match it finds.
>
> Here is the code as created by RegExRX:
>
> dim rx as new RegEx
> rx.SearchPattern = "Seq_1\s+(\d+)\s+([A-Za-z]+)(.+)$"
>
> dim match as RegExMatch = rx.Search( "SourceText" )
>
> IF you know that the last part of the string will start with a digit, this
> would work too, regardless of greedy:
>
> Seq_1\s+(\d+)\s+([^\d\s]+)(\d.+)$
>
> But honestly, it's easier just to keep greedy on.
>
> Finally, try out RegExRX, available at my web site, to help with your
> regular expression development and testing. You'll find it's a lot easier
> than having to write test variables in your code.
>
> On Sep 25, 2011, at 6:03 AM, Franck Perez wrote:
>
> >
> > Dear List,
> >
> > I am always kind of lost with RegEx, but here more than ever.
> > I have a text in a TextField that contain lines that looks like
> >
> > Seq_1 1 GTTAGGCGTTTTGCGCTGCTTCGCGATGTACGGGCCAGATATACGCGTTGACATTGATTA
> > 60
> >
> >
> > I need to extract the starting number (here 1) and the sequence itself
> > (without surrounding spaces)
> > I use RegEx search as follows.
> >
> > RegExpObj.options.greedy = false // disable greediness
> > RegExpObj.options.DotMatchAll = true
> > RegExpObj.searchPattern = "Seq_1\s*(\d*)\s*([A-Za-z]*)(.*)$" // \s*$"
> > //find a Seq1 line
> > RegExpMatchObj = RegExpObj.Search (AlignedField.text, SearchStart)
> > if RegExpMatchObj<>nil then
> > Dim testStr0, testStr1, TestStr2, TestStr3 as String
> >
> > SearchStart= RegExpMatchObj.SubExpressionStartB(0)+1
> > ValNtBegining1.Append val(RegExpMatchObj.SubExpressionString(1)) //to
> > find the char where the line starts
> > testStr0 = RegExpMatchObj.SubExpressionString(0)
> > testStr1 = RegExpMatchObj.SubExpressionString(1)
> > testStr2 = RegExpMatchObj.SubExpressionString(2)
> > testStr3 = RegExpMatchObj.SubExpressionString(3)
> > end if
> >
> >
> > Results:
> > TestStr0 = Seq_1 1
> > atggtgagcaagggcgaggagctgttcaccggggtggtgcccatcctggtcgagctggac 60
> > TestStr1 = empty
> > TestStr2 = empty
> > TestStr3 = 1
> > atggtgagcaagggcgaggagctgttcaccggggtggtgcccatcctggtcgagctggac 60
> >
> > I am fully confused here.
> > How can the RegEx find its target (not nil) and at the same time
> > return RegExpMatchObj.SubExpressionString(1)
> > and RegExpMatchObj.SubExpressionString(2) as empty strings and give
> > a RegExpMatchObj.SubExpressionString(3) that still contains the spaces
> > present before "1" as well as "1" itself ??
> >
> > Thanks for your helps,
> > best,
> > Franck.
> > _______________________________________________
> > Unsubscribe or switch delivery mode:
> > <http://www.realsoftware.com/support/listmanager/>
> >
> > Search the archives:
> > <http://support.realsoftware.com/listarchives/lists.html>
> --
> Kem Tekinay
> MacTechnologies Consulting
> (212) 201-1465
> (914) 242-7294 Fax
> http://www.mactechnologies.com
>
> _______________________________________________
> Unsubscribe or switch delivery mode:
> <http://www.realsoftware.com/support/listmanager/>
> Search the archives:
> <http://support.realsoftware.com/listarchives/lists.html>
_______________________________________________
Unsubscribe or switch delivery mode:
<http://www.realsoftware.com/support/listmanager/>

Search the archives:
<http://support.realsoftware.com/listarchives/lists.html>

Re: RegEx mystery
Date: 25.09.11 17:00 (Sun, 25 Sep 2011 12:00:10 -0400)
From: Kem Tekinay
Replace the "*" (zero or more) with "+" (1 or more), then turn "greedy" on and see what happens. When I test that here in RegExRX, I get the following matches:

$1 = "1"
$2 = "GTTAGGCGTTTTGCGCTGCTTCGCGATGTACGGGCCAGATATACGCGTTGACATTGATTA"
$3 = "60"

I think that's what you're looking for.

The logic behind "greedy" is essentially that the regex engine will work forwards when it's off, and backwards when it's on and present you with either the first match it finds.

Here is the code as created by RegExRX:

dim rx as new RegEx
rx.SearchPattern = "Seq_1\s+(\d+)\s+([A-Za-z]+)(.+)$"

dim match as RegExMatch = rx.Search( "SourceText" )

IF you know that the last part of the string will start with a digit, this would work too, regardless of greedy:

Seq_1\s+(\d+)\s+([^\d\s]+)(\d.+)$

But honestly, it's easier just to keep greedy on.

Finally, try out RegExRX, available at my web site, to help with your regular expression development and testing. You'll find it's a lot easier than having to write test variables in your code.

On Sep 25, 2011, at 6:03 AM, Franck Perez wrote:

>
> Dear List,
>
> I am always kind of lost with RegEx, but here more than ever.
> I have a text in a TextField that contain lines that looks like
>
> Seq_1 1 GTTAGGCGTTTTGCGCTGCTTCGCGATGTACGGGCCAGATATACGCGTTGACATTGATTA
> 60
>
> I need to extract the starting number (here 1) and the sequence itself
> (without surrounding spaces)
> I use RegEx search as follows.
>
> RegExpObj.options.greedy = false // disable greediness
> RegExpObj.options.DotMatchAll = true
> RegExpObj.searchPattern = "Seq_1\s*(\d*)\s*([A-Za-z]*)(.*)$" // \s*$"
> //find a Seq1 line
> RegExpMatchObj = RegExpObj.Search (AlignedField.text, SearchStart)
> if RegExpMatchObj<>nil then
> Dim testStr0, testStr1, TestStr2, TestStr3 as String
>
> SearchStart= RegExpMatchObj.SubExpressionStartB(0)+1
> ValNtBegining1.Append val(RegExpMatchObj.SubExpressionString(1)) //to
> find the char where the line starts
> testStr0 = RegExpMatchObj.SubExpressionString(0)
> testStr1 = RegExpMatchObj.SubExpressionString(1)
> testStr2 = RegExpMatchObj.SubExpressionString(2)
> testStr3 = RegExpMatchObj.SubExpressionString(3)
> end if
>
> Results:
> TestStr0 = Seq_1 1
> atggtgagcaagggcgaggagctgttcaccggggtggtgcccatcctggtcgagctggac 60
> TestStr1 = empty
> TestStr2 = empty
> TestStr3 = 1
> atggtgagcaagggcgaggagctgttcaccggggtggtgcccatcctggtcgagctggac 60
>
> I am fully confused here.
> How can the RegEx find its target (not nil) and at the same time
> return RegExpMatchObj.SubExpressionString(1)
> and RegExpMatchObj.SubExpressionString(2) as empty strings and give
> a RegExpMatchObj.SubExpressionString(3) that still contains the spaces
> present before "1" as well as "1" itself ??
>
> Thanks for your helps,
> best,
> Franck.
> _______________________________________________
> Unsubscribe or switch delivery mode:
> <http://www.realsoftware.com/support/listmanager/>
> Search the archives:
> <http://support.realsoftware.com/listarchives/lists.html>

Re: RegEx mystery
Date: 25.09.11 12:40 (Sun, 25 Sep 2011 13:40:38 +0200)
From: Stéphane Mons <
This works:

RegExpObj.options.greedy = true // <<<<<<<<<<<< ENABLE greediness
RegExpObj.options.DotMatchAll = true
RegExpObj.searchPattern = "Seq_1\s*([0-9]+)\s*([ACGT]+)\s*(.*)$"

For the pattern, I personally never use \d but [0-9] instead. Also, I changed "*" to "+" because "*" allows for empty strings. The problem here was partly that you disabled greediness while using "*", so every character actually matched the last "(.*)" (i.e. any character) while each expression before that was "not greedy enough" to catch some characters. With the above example, I get:

0: Seq_1 1 GTTAGGCGTTTTGCGCTGCTTCGCGATGTACGGGCCAGATATACGCGTTGACATTGATTA 60
1: 1
2: GTTAGGCGTTTTGCGCTGCTTCGCGATGTACGGGCCAGATATACGCGTTGACATTGATTA
3: 60

HTH

Le 25 sept. 2011 à 12:03, Franck Perez a écrit :

> Dear List,
>
> I am always kind of lost with RegEx, but here more than ever.
> I have a text in a TextField that contain lines that looks like
>
> Seq_1 1 GTTAGGCGTTTTGCGCTGCTTCGCGATGTACGGGCCAGATATACGCGTTGACATTGATTA
> 60
>
>
> I need to extract the starting number (here 1) and the sequence itself
> (without surrounding spaces)
> I use RegEx search as follows.
>
> RegExpObj.options.greedy = false // disable greediness
> RegExpObj.options.DotMatchAll = true
> RegExpObj.searchPattern = "Seq_1\s*(\d*)\s*([A-Za-z]*)(.*)$" // \s*$"
> //find a Seq1 line
> RegExpMatchObj = RegExpObj.Search (AlignedField.text, SearchStart)
> if RegExpMatchObj<>nil then
> Dim testStr0, testStr1, TestStr2, TestStr3 as String
>
> SearchStart= RegExpMatchObj.SubExpressionStartB(0)+1
> ValNtBegining1.Append val(RegExpMatchObj.SubExpressionString(1)) //to
> find the char where the line starts
> testStr0 = RegExpMatchObj.SubExpressionString(0)
> testStr1 = RegExpMatchObj.SubExpressionString(1)
> testStr2 = RegExpMatchObj.SubExpressionString(2)
> testStr3 = RegExpMatchObj.SubExpressionString(3)
> end if
>
>
> Results:
> TestStr0 = Seq_1 1
> atggtgagcaagggcgaggagctgttcaccggggtggtgcccatcctggtcgagctggac 60
> TestStr1 = empty
> TestStr2 = empty
> TestStr3 = 1
> atggtgagcaagggcgaggagctgttcaccggggtggtgcccatcctggtcgagctggac 60
>
> I am fully confused here.
> How can the RegEx find its target (not nil) and at the same time
> return RegExpMatchObj.SubExpressionString(1)
> and RegExpMatchObj.SubExpressionString(2) as empty strings and give
> a RegExpMatchObj.SubExpressionString(3) that still contains the spaces
> present before "1" as well as "1" itself ??
>
> Thanks for your helps,
> best,
> Franck.

5 REM My Signature
10 PRINT "Stéphane"
20 GOTO 10



_______________________________________________
Unsubscribe or switch delivery mode:
<http://www.realsoftware.com/support/listmanager/>

Search the archives:
<http://support.realsoftware.com/listarchives/lists.html>