How to extract string from file, run filter, and replace in file with new value?

+4

−0

TASK

I am coding up ebooks to a specific standard, and have a script that converts a string into the correct titlecase for this publisher. When working with some public domain source files, one often gets this for a chapter title string:

HERE IS MY TITLE

Using VSCodium (FOSS VS Code alternative), I can open each file, select the string between the p tags, then run the titlecase script with a hotkey that I've assigned it to. I end up with

Here Is My Title

(VSCodium's native titlecase filter isn't up to this job.) I save the file, and go on to the next one.

If you only have a few of these to do, that's fine. But sometimes there can be dozens, and it gets very tedious.

QUESTION

Is there a way that I can script this? I have scratched my head over both awk and sed, thinking that these are my prime options. But (as a rank amateur) I cannot work out how to:

iterate through all chapter-*.xhtml files in a directory,
extract my string (ALWAYS line 12 in the file, only string on line, between ... tags),
run my "external" titlecase filter on that string,
replace the new string for the original one in the source file,
for all those files. :)

(The step in bold is the one that is my biggest stumbling block.)

UPDATE: Note that for my titlecase filter, ONLY the string between the tags can be used, so that step #2 (extracting the string) is mandatory. Both the answers so far look very promising, but is it possible to do something like e.g. a regex on sed -n '12p' in one answer?

The other answer suggests using pup although it would be helpful not to need extra packages if a simple regex would do.

UPDATE 2: for "real" data, one could download the ZIP of this commit in a Github repo - the files in question are found at: /src/epub/text/chapter-*.xhtml = the 12th line of every "chapter-nn.xhtml" file.

scripting sed text-processing

posted over 2 years ago

CC BY-SA 4.0

2y ago

David‭

31 reputation 1 1 5 4

Copy Link

Raw

Markdown

History

0 comment threads

You are accessing this answer with a direct link, so it's being shown above all other answers regardless of its score. You can return to the normal view.

+4

−0

(Assuming your file names are portable, according to POSIX (https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_282). If not, please read this for writing a more robust script: https://linux.codidact.com/posts/288310/289999#answer-289999.)

find . -type f \
| grep '/chapter-[^/]*.xhtml$' \
| while read f; do
        (
                head -n11 <"$f";
                sed -n '12p' <"$f" | titlecase;
                tail -n+13 <"$f";
        ) \
        | sponge "$f";
done;

find | grep is for getting the file names.
while read starts a subshell for each file name, where the inner commands are run.
() | sponge "$f" will put everything printed by the (...) into the original file, atomically after all other commands have finished. sponge(1) is provided by the moreutils package.
head -n11 prints the first 11 lines pristine.
sed -n 12p prints the 12th line pristine.
titlecase: Your script, assuming it reads stdin, and writes to stdout.
tail -n+13 prints the remaining lines, starting at 13, pristine.

Disclaimer: untested. If you provide some samples, I'll test it.

If for some reason, you'd want to read a file only once, you could try writing a more complex filter using perl(1) (or maybe you manage to write it in sed(1)). That would remove the need for head(1) and tail(1).

$ printf 'ASD\nFOO BAR BAZ\nQWE\nRTY\n' \
| perl -p -e 's/(?<=[[:alpha:]])([[:upper:]])/\L\1/g if 2 .. 3';
ASD
Foo Bar Baz
Qwe
RTY

posted over 2 years ago

CC BY-SA 4.0

2y ago

alx‭

382 reputation 8 17 44 113

Copy Link

Raw

Markdown

History

0 comment threads

+4

−0

Worked for alx‭

The following users marked this post as Works for me:

User	Comment	Date
alx‭	Thread: Works for me Nice sed(1) regex! It looks obvious after seeing it.	Dec 6, 2023 at 00:34

for file in in chapter-*.xhtml
do
    sed -ir "12s/\b([A-Z])([A-Z]+)/\1\L\2/g;" "$file" 
done

This -ir tells GNU-sed so alter the file in place (-i) and use regexp-extended (-r).

For line 12 substitute from word boundary an uppercase letter (1) followed by multiple uppercase letters (2) with no1 untouched, but the 2nd pattern replaced by lowercase (\L), and to repeat this procedure globally (/g).

Note that this will turn USB to Usb, USA to Usa, UNO to Uno and so on.

For reaching into subdirectories,

find -name "chapter-*.xhtml" -exec sed -ir "12s/\b([A-Z])([A-Z]+)/\1\L\2/g;" {} ";"

Again, GNU-tools (find, sed) are assumed (as default on most Linux systems).

posted about 2 years ago

CC BY-SA 4.0

2y ago by Mithical‭

user-unknown‭

41 reputation 0 1 4 1

Copy Link

Raw

Markdown

History

1 comment thread

Works for me (1 comment)

+2

−0

iterate through all chapter-*.xhtml files in a directory

Assuming bash, and assuming that at least one such file exists in the current directory (otherwise adjust the path and/or shopt -s nullglob), you can use a simple for loop to do this.

for filename in chapter-*.xhtml; do
    ...
done

extract my string (ALWAYS line 12 in the file, only string on line, between ... tags)

Since you know that this will always be line 12, the easiest-to-read way to do this is probably awk, in which it becomes:

line12=$(awk 'NR==12{print;exit;}' "$filename")

Do note that the resulting $line12 will include whitespace and tags in addition to the textual content of the tag. If this is a problem, you can use pup to extract only the text from within the  tag:

line12=$(awk 'NR==12{print;exit;}' "$filename" | pup p text{})

in which case of course you will need to adjust the replacement step accordingly.

(pup is a tool to parse HTML and extract portions of it based on CSS selectors.)

run my "external" titlecase filter on that string

Assuming that titlecase is executable, accepts the old title on standard input, and emits the new title on standard output, you can pipe the output from awk above into titlecase, as in:

newline12=$(awk 'NR==12{print;exit;}' "$filename" | titlecase)

replace the new string for the original one in the source file

There are many ways to do this, but assuming that the replacement doesn't contain special characters, you can do something similar to:

sed -i '12s#^.*$#'"$newline12"'#' "$filename"

This will replace the entirety of line 12 in the file with the contents of the $newline12 environment variable. Adjust the 12 if you need to replace a differently numbered line. I use # as delimeters here because the traditional / will conflict with the end-tag marker in .

-i is inline editing mode; if you omit it, sed will print the result on standard output, which you can redirect to another file:

sed '12s#^.*$#'"$newline12"'#' "$filename" >"$filename".new

Putting it all together:

for filename in chapter-*.xhtml; do
    newline12=$(awk 'NR==12{print;exit;}' "$filename" | titlecase)
    sed -i '12s#^.*$#'"$newline12"'#' "$filename"
done

Example:

Input `chapter-1.xhtml`

<html>
<xx>
<xx>
<xx>
<xx>
<xx>
<xx>
<xx>
<xx>
<xx>
<xx>
  <p>HERE IS MY TITLE</p>

</html>

Execution

I use an alias in place of your likely actual titlecase here, but the principle is exactly the same:

$ alias titlecase='tr A-Z a-z'
$ for filename in chapter-*.xhtml; do
    newline12=$(awk 'NR==12{print;exit;}' "$filename" | titlecase)
    sed -i '12s#^.*$#'"$newline12"'#' "$filename"
  done

Output `chapter-1.xhtml`

<html>
<xx>
<xx>
<xx>
<xx>
<xx>
<xx>
<xx>
<xx>
<xx>
<xx>
  <p>here is my title</p>

</html>

posted over 2 years ago

CC BY-SA 4.0

Canina‭

1196 reputation 4 29 147 37

Copy Link

Raw

Markdown

History

1 comment thread

Thanks @Canina‭ - helpful! I see from your answer (and even more the other one) that I need to edit my ... (4 comments)

+1

−0

Both replies, at different points, provided the basis for this working script. Assuming that the 12th line of file has something like:

    <p>HERE IS MY TITLE</p>

where HERE... begins at column 8 (I need to omit the opening  tag, as noted in the original post), then:

for filename in chapter-*.xhtml; do
    new12=$(sed -n '12p' "$filename" | cut -b 8- | se titlecase -n)
    sed -i -e '12s#^\(.*<p>\).*#'"\1$new12"'#g' "$filename"
  done

The two middle lines work this way:

Line 2:

sed -n '12p' "$filename" = print the 12th line of the file
cut -b 8- = "cut" from the 8th column, so in this example, passing the string HERE IS MY TITLE to the pipe
se titlecase -n = run the titlecase script (-n prevents it from generating a "newline")
all that assigned to $new12.

Line 3

sed -i -e '12s#^$.*$.*#'"\1$new12"'#g' = replace original line 12, capturing the first part of the line, up to the opening  in a backreference group, so \1 in the "replace", combined with the $new12 value.

Produces:

    <p>Here Is My Title</p>

in "$filename". Done. :)

posted about 2 years ago

CC BY-SA 4.0

2y ago

David‭

31 reputation 1 1 5 4

Copy Link

Raw

Markdown

History

1 comment thread

Command substitution `$(...)` already strips the trailing newline. (1 comment)

Communities

How to extract string from file, run filter, and replace in file with new value?

0 comment threads

4 answers

0 comment threads

1 comment thread

Putting it all together:

Example:

Input `chapter-1.xhtml`

Execution

Output `chapter-1.xhtml`

1 comment thread

1 comment thread

Communities

How to extract string from file, run filter, and replace in file with new value?

0 comment threads

4 answers

0 comment threads

1 comment thread

Putting it all together:

Example:

Input chapter-1.xhtml

Execution

Output chapter-1.xhtml

1 comment thread

1 comment thread

Input `chapter-1.xhtml`

Output `chapter-1.xhtml`