How can I remove duplicate filenames from different directories in a list and use them once as xml tag text?

Charlotte · November 22, 2024, 11:16am

I have a file tree like :

2_Product
    2-1_CategoryName1_Product
        2-1-1_Name1_Product
            LLL_nomenclature1_product.zip
                LLL_nomenclature1_product (folder)
                    notice_nomenclature1.pdf
            LLL_nomenclature1_product_metadata.xml
            LLL_nomenclature2_product.zip
                LLL_nomenclature2_product (folder)
                    notice_nomenclature2.pdf
            LLL_nomenclature2_product_metadata.xml
            LLL_nomenclature3_product.zip
                LLL_nomenclature3_subproduct1 (folder)
                    notice_nomenclature3.pdf
                LLL_nomenclature3_subproduct2 (folder)
                    notice_nomenclature3.pdf
                LLL_nomenclature3_subproduct3 (folder)
                    notice_nomenclature3.pdf
            LLL_nomenclature3_product_metadata.xml
            ... etc
        2-1-2_Name2_Product
        2-1-3_ ...etc
    2-2_CategoryName2_Product
        2-2-1_ ...
        2-2-2_ ...
    ... etc

I have a script that searches my zipped folders for the ‘notice_nomenclatureX.pdf’ files and then adds a tag in the xml of the associated product with the name of the associated notice in it (here ‘notice_nomenclature1.pdf’ for example).

import os
import xml.etree.ElementTree as ET
import zipfile

for root, dirs, files in os.walk("."):
    for folder_ext in files:
        if folder_ext[-4:] == '.zip' and folder_ext[:3] == 'LLL': 
            filePath3 = os.path.join(root, folder_ext)
            zip_folder = zipfile.ZipFile(filePath3)
            zipfile_paths = zip_folder.namelist() 
            for paths in zipfile_paths: 
                zipfiles = os.path.basename(paths)
                if zipfiles[-4:] == '.pdf' and zipfiles[:3] == 'not':
                    notice_name = zipfiles 
                    for prdt in files: 
                        if prdt[-4:] == '.xml' and prdt[:-13] == folder_ext[:-4] : 
                            filePath4 = os.path.join(root, prdt) 
                            xml_produit = ET.parse(filePath4) 
                            root_produit = xml_produit.getroot() 
                            notice_tag = ET.SubElement(root_produit, "notice_pdf") 
                            notice_tag.text = notice_name 
                            ET.indent(root_produit) 
                            xml_produit.write(filePath4, encoding='utf-8', xml_declaration=True, method='xml', short_empty_elements=False)

My script works well for ‘nomenclature1’ and ‘nomenclature2’ and gives this in my xml (what I want) :

<?xml version="1.0" encoding="UTF-8"?>
<gmd:MD_Metadata xmlns:gmd="http:...">
.
.
.
<notice_pdf>notice_nomenclature1.pdf</notice_pdf>
</gmd:MD_Metadata>

But for ‘nomenclature3’, I get (what I don’t want) :

<?xml version="1.0" encoding="UTF-8"?>
<gmd:MD_Metadata xmlns:gmd="http:...">
.
.
.
<notice_pdf>notice_nomenclature3.pdf</notice_pdf>
<notice_pdf>notice_nomenclature3.pdf</notice_pdf>
<notice_pdf>notice_nomenclature3.pdf</notice_pdf>
</gmd:MD_Metadata>

How do I write in my script that when the ‘zipfiles’ variable contains the same notice name several times, it only transcribes one of them in the xml tag ?

I’ve tried using .sort() and sorted, to no avail.

And I tried this :

...
new_list = []
for paths in zipfile_paths:
    zipfiles = os.path.basename(paths)
    if zipfiles[-4:] == '.pdf' and zipfiles[:3] == 'not': 
        if zipfiles not in new_list:
            new_list.append(zipfiles)
            notice_name = new_list
... etc

“notice_nomenclature3.pdf” appears only once in “new_list” but when I run the script, it has a problem with the list format and it returns the following error :

TypeError: write() argument must be str, not list

Would you know how I can achieve the desired result ?
Thank you.

Maja · November 22, 2024, 11:40am

To ensure that duplicate entries like notice_nomenclature3.pdf are added only once to your XML, you need to maintain a record of processed notice names per XML file. Here’s how you can modify your script:

Key Changes:

Use a set to track the notice names that have already been added to the XML file.
Ensure the notice_name variable holds a string, not a list.

Updated Script:

import os
import xml.etree.ElementTree as ET
import zipfile

for root, dirs, files in os.walk("."):
    for folder_ext in files:
        if folder_ext.endswith('.zip') and folder_ext.startswith('LLL'):
            filePath3 = os.path.join(root, folder_ext)
            zip_folder = zipfile.ZipFile(filePath3)
            zipfile_paths = zip_folder.namelist()
            
            # Use a set to track processed notices
            processed_notices = set()
            
            for paths in zipfile_paths:
                zipfiles = os.path.basename(paths)
                
                if zipfiles.endswith('.pdf') and zipfiles.startswith('not'):
                    notice_name = zipfiles
                    
                    if notice_name not in processed_notices:  # Check if already processed
                        processed_notices.add(notice_name)  # Add to the set
                        
                        for prdt in files:
                            if prdt.endswith('.xml') and prdt[:-13] == folder_ext[:-4]:
                                filePath4 = os.path.join(root, prdt)
                                
                                xml_produit = ET.parse(filePath4)
                                root_produit = xml_produit.getroot()
                                
                                # Check if the notice_pdf tag already exists
                                existing_notices = [tag.text for tag in root_produit.findall('notice_pdf')]
                                if notice_name not in existing_notices:
                                    notice_tag = ET.SubElement(root_produit, "notice_pdf")
                                    notice_tag.text = notice_name
                                
                                # Write back to the XML file
                                ET.indent(root_produit)
                                xml_produit.write(filePath4, encoding='utf-8', xml_declaration=True, method='xml', short_empty_elements=False)

Explanation of Changes:

Tracking Processed Notices with set:

The processed_notices set ensures that each notice_name is handled only once during the loop over zipfile_paths.

Prevent Duplicate Tags in XML:

Before adding a new notice_pdf tag, the script checks the existing notice_pdf tags in the XML file (existing_notices) to avoid duplicates.

Error Fix (TypeError):

The notice_name variable now holds a string, not a list. This resolves the TypeError: write() argument must be str, not list.

Example Output:

For nomenclature3 with three identical notice_nomenclature3.pdf, the XML will now have:

<?xml version="1.0" encoding="UTF-8"?>
<gmd:MD_Metadata xmlns:gmd="http:...">
    ...
    <notice_pdf>notice_nomenclature3.pdf</notice_pdf>
</gmd:MD_Metadata>

This approach ensures the XML file remains clean and doesn’t have duplicate tags for the same notice. Let me know if you encounter any issues!