Write a Linux command line to traverse a directory of files and summarize the number of instances of each file extension found (e.g., .docx, .jpg, .pdf, etc). Fold all file extensions to lower case-- ".JPG" and ".jpg" should be reported together. Files that have no extension should be reported as "other". List the extensions found in descending numeric order by the number of occurrences of each extension.
#Linux #DFIR #CommandLine #Trivia
Hal Pomeranz reshared this.
lamitpObuS
in reply to Hal Pomeranz • • •Also treat hidden files?
.foo - as "other"
.foo.bar - as extension bar
Hal Pomeranz
in reply to lamitpObuS • • •Stefan Midjich ꙮ҄
in reply to Hal Pomeranz • • •are you expecting a one-liner?
because this almost requires hash tables, or associative arrays. which isn't that simple to do in bash, or any shell scripting language.
I'd much rather use python or nodejs for this task.
I'm sure an awk-ninja could do it, but that isn't strictly shell.
paste.centos.org/view/025605f5
UNTITLED - Pastebin Service
paste.centos.orgHal Pomeranz
in reply to Stefan Midjich ꙮ҄ • • •Stefan Midjich ꙮ҄
in reply to Hal Pomeranz • • •well the famous geirha on IRC did it for me.
shopt -s nullglob dotglob globstar ; declare -A ext=() ; for file in **/*.* ; do (( ext[".${file##*.}"]++ )) ; done ; for e in "${!ext[@]}" ; do printf '%3d %s\n' "${ext[$e]}" "$e" ; done | sort -rn | head
I could never do that myself.
Hal Pomeranz
in reply to Stefan Midjich ꙮ҄ • • •Stefan Midjich ꙮ҄
in reply to Hal Pomeranz • • •Hal Pomeranz
in reply to Stefan Midjich ꙮ҄ • • •Stefan Midjich ꙮ҄
in reply to Hal Pomeranz • • •Hal Pomeranz
in reply to Stefan Midjich ꙮ҄ • • •apgarcia
in reply to Hal Pomeranz • • •solution
find . | while read f; do echo ${f##*.}; done | sort | uniq -c | sort -rn
silverwizard
in reply to Hal Pomeranz • •find . -type f -exec basename {} \; | grep '\.' | tr A-Z a-z | awk -F "." '{print $NF}' | sort | uniq -c | sort -n ; printf ' ' ; find . -type f -exec basename {} \; | grep -cv '\.'| sed's/$/ other/'
Let's not talk about the difference between jpg and jpeg and thing about how terrible that looks
lamitpObuS
in reply to Hal Pomeranz • • •one line
find $count_dir -type f -printf "%f\n" | while read file; do ext=${file##*.}; [[ "$file" == "$ext" ]] && echo other || echo $ext; done | sort -i | uniq -ci | sort -nr
Florian Diesch
in reply to Hal Pomeranz • • •find . -printf '%f\n'| sed -E -e 's/^[^.]+$/other/' -e 's/.*\.//' | sort -f | uniq -ci | sort -nr
Füsilier Breitlinger
in reply to Hal Pomeranz • • •The
lc
(lowercase) in the Perl code could be replaced bysort -f | uniq -ic
, but that's both longer and harder to understand, I think.Remove
-type f
to include directories in the report (most of them will show up as "other", I expect.)The regex in the Perl code could be simplified slightly at the cost of
-printf '%f\n'
infind
, but that's longer.A version without perl:
But I think the perl version is actually more readable.
Hal Pomeranz
in reply to Hal Pomeranz • • •I took yesterday's Linux DFIR command line trivia from one of our old Command Line Kung Fu blog postings (Episode 99). It's an interesting challenge and useful for quickly summarizing types of files in a directory by simply collecting and counting the file extensions.
I also liked it because we get to use the "${var##pattern}" expansion, similar to my solution from a couple of days ago using "${var%%pattern}". So there's some nice symmetry.
If you read the original blog posting, my first solution used a funky "sed" substitution to whittle down to the file extensions. It was actually loyal reader and friend of the blog Jeff Haemer who suggested this much cleaner version (and to @barubary who reminded me to quote my variables) :
find Documents -type f |
while read f; do
echo "${f##*.}";
done |
sed 's/.*\/.*/other/' | tr A-Z a-z |
sort | uniq -c | sort -n
"find" gets us a list of files in the directory and then we feed that into a loop that uses "${f##*.}" to remove everything up until the final "." in the file path.
Some file names are not going to have a "." and so the original file path will be unchanged. The "sed" expression after the loop marks these files as being in category "other". Finally we shift everything to lowercase so "GIF" and "gif" are recognized as the same type. Then our usual command line histogram idiom rounds things out.
We could add some other fix-ups to make things nicer:
find Documents -type f |
while read f; do
echo "${f##*.}";
done |
sed 's/.*\/.*/other/' | tr A-Z a-z |
sed -r 's/^jpg$/jpeg/; s/(~|,v)$//' |
sort | uniq -c | sort -n
I've dropped in another "sed" expression before we start counting things. We make "jpg" and "jpeg" count the same and remove some trailing file extensions for backup files to reduce clutter in the output. This just shows that you can arbitrarily tweak the output to suit your needs.
Props to @apgarcia for getting very close to the final solution on this one!
#Linux #DFIR #CommandLine #Trivia
Hal Pomeranz reshared this.
Füsilier Breitlinger
in reply to Hal Pomeranz • • •OK, this one is broken again. Most obviously, it lists the result in ascending numeric order, not descending. 😉
Names without extension are indistinguishable from files named
foo.other
in the output.However, it also mishandles
foo.*
will produce funny output)At minimum you should change the read loop to:
That fixes the issues with spaces, backslashes, and wildcard characters.
However, since all you're doing is removing characters up to the last
.
, you could replace the whole loop by justsed 's/.*\.//'
(orsed 's/.*\././'
if you want to distinguish between.other
andother
).Hal Pomeranz
in reply to Füsilier Breitlinger • • •@barubary Sigh. It's like you've become my shell programming conscience. "Forgive me, Father, for I have committed the following bash sins..."
My original solution in the blog was using sed, but I liked the loop version for readability.
But I am going to sneak back and edit my answer to quote the variable in the loop. Thank you.
apgarcia
in reply to Hal Pomeranz • • •i came across a really cool alternative to the 'sort | uniq -c | sort -rn' histogram pattern:
github.com/red-data-tools/YouP…
see screenshot...
GitHub - red-data-tools/YouPlot: A command line tool that draw plots on the terminal.
GitHubHal Pomeranz reshared this.
lamitpObuS
in reply to Hal Pomeranz • • •@apgarcia @barubary
When I run your solution for a directory with around 19.000 files your solution takes around 5 secs, while mine around 1.5 secs.
I use the case insensitive option for sort and uniq, instead of converting all extensions to lowercase. As it was only required that "abc" and "ABC" should be counted as the same extension, not that both must be converted to "abc".
My solution also missed the point with files having "other" as extension.
Our results were the same.